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Searching for Semantics: Data 
Mining, Reverse Engineering 

Stefano Spaccapietra Fred Maryanski 

Swiss Federal Institute of Technology University of Connecticut 

Lausanne, Switzerland Storrs, CT, USA 



REVIEW AND FUTURE DIRECTIONS 

In the last few years, database semantics research has turned sharply from a 
highly theoretical domain to one with more focus on practical aspects. The DS- 
7 Working Conference held in October 1997 in Leysin, Switzerland, demon- 
strated the more pragmatic orientation of the current generation of leading 
researchers. The papers presented at the meeting emphasized the two major 
areas: the discovery of semantics and semantic data modeling. 

The work in the latter category indicates that although object-oriented 
database management systems have emerged as commercially viable prod- 
ucts, many fundamental modeling issues require further investigation. Today’s 
object-oriented systems provide the capability to describe complex objects 
and include techniques for mapping from a relational database to objects. 
However, we must further explore the expression of information regarding 
the dimensions of time and space. Semantic models possess the richness to 
describe systems containing spatial and temporal data. The challenge of in- 
corporating these features in a manner that promotes efficient manipulation 
by the subject specialist still requires extensive development. 

The papers at DS-7 highlighted important progress on the discovery of 
semantics. Several authors discussed techniques for inferring information on 
relationships among elements. The nature of these relationships varies from 
identifying elements with corresponding properties and meaning, to the deriva- 
tion of statistical properties of the database. This category of papers includes 
the keynote address, 

• OLAP Mining: An Integration of OLAP with Data Mining, by Jiawei Han, 
and the following submitted papers: 

• CE: The Classifier-Estimator Framework for Data Mining, by Dalkilic, Van 
Gucht, and Robertson. 

• Contribution to the Reverse Engineering of 00 Applications Methodology 
and Case Study, by Hainaut et al. 

• The Reengineering of Relational Databases Based on Key and Data Cor- 
relations, by Tari et al. 

• Discovering and Reconciling Semantic Conflicts: A Data Mining Perspec- 
tive, by Lu et al. 




Preface 



viii 



Other researchers have focussed upon the extraction of constraints or ba- 
sic data properties which govern the behavior of all database objects in a 
schema. The results presented at the meeting represent impressive advances 
in the discovery of database semantics, but most projects require additional 
investigation to assure that the information produced by the systems are fully 
understood by the database designers and users and that a complete range of 
useful information is provided. This group of papers includes 

• Experience with a Combined Approach to Attribute-Matching across Het- 
erogeneous Databases, by Clifton, Housman, and Rosenthal. 

• Searching for Semantics in COBOL Legacy Applications, by Andersson. 

• TopiCA: A Semantic Framework for Landscaping the Information Space in 
Federated Digital Libraries, by Papazoglou, Weigand, and Milliner. 

• Managing Constraint Violations in Administrative Information Systems, 
by Boydens, Pirotte, and Zimanyi. 

• Automatic Generation of Update Rules to Enforce Consistency Constraints 
in Design Databases, by Browner et al. 

• Improvements in Supervised BRAINNE: A Method for Symbolic Data Min- 
ing Using Neural Networks, by Dillon et al. 

• A Formalization of ODMG Queries, by Riedel and Scholl. 

The above database semantic discovery approaches were complemented by 
the invited tutorial on text mining given by Martin Rajman: 

• Text Mining: Natural Language Techniques and Text Mining Applications, 
by Rajman and Besangon 

One of the more interesting directions reported at DS-7 is the emergence of 
the World Wide Web as the object of data mining investigations. The move 
from highly structured database tables with rich meta data to loosely, even in- 
consistently, structured web pages represents an important step toward broad 
utilization of data mining. The proposals for mining data, or even semantics, 
from the Web make very different assumptions regarding the level and type 
of human interaction. In some situations, the projects are utilizing web pages 
in their native form, others involve only certain HTML constructs, while a 
third set proposes meta data structures for Web pages. Successfully exporting 
data mining and reverse engineering methodologies from the structured world 
of databases to the virtually wide open web environment would be a signifi- 
cant breakthrough that would enhance the importance of database semantics 
research. The invited talk given by Letizia Tanca: 

• Semantic Approaches to Structuring and Querying Web Sites, by Damiani 
and Tanca, 



focussed on semantics of Web applications, as well as the following submitted 
papers: 
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• Incorporating Generalized Quantifiers into Description Logic for Repre- 
senting Data Source Contents, by Tu and Madnick. 

• An Associative Search Method Based on Symbolic Filtering and Semantic 
Ordering for Database Systems, by Yoshida, Kiyoki, and Kitagawa. 

• OFAHIR: “On-the-Fly” Automatic Authoring of Hypertexts for Informa- 
tion Retrieval, by Agosti, Benfante, and Melucci. 

Although object-oriented database system products have established a note- 
worthy commercial niche, they function for the most part as relational exten- 
sions. A general methodology for embedding descriptive semantics beyond 
those of rich data types has not yet emerged. Although the problems of man- 
aging temporal data are well defined, effective operation in a temporal domain 
requires more than data typing as it impacts the fundamental constructs of 
the system software. We need further work to integrate all aspects of process- 
ing temporal, spatial, and other dimensions of data into the next generation 
of systems. Two papers which present some fundamental advances for these 
new systems are: 

• Modeling and Implementation of Temporal Aspects of Objects in Commer- 
cial Information Systems, by Gruhn, Feger, and Wever. 

• Design Patterns for Spatio-Temporal Processes, by Claramunt, Parent, and 
Theriault. 

Overall, DS-7 demonstrated the steady progress of semantics research and 
opened the door to some tantalizing future discoveries. We hope the readers 
of this book wil share our interest in its content. 

We would like to express our deepest thanks to the people that made the 
DS-7 conference a success: 

• the invited speakers, Jiawei Han, Letizia Tanca and Martin Rajman, and 
the authors of the accepted papers for giving us the opportunity to offer a 
high quality program, 

• the authors of the papers that we could not accept, whom we hope to meet 
at the DS-8 conference, 

• the general chairs, Stuart Madnick and Alain Pirotte, for devoting their 
precious time to the conference, 

• the organizing committee chair, Esteban Zimanyi, whose relentless com- 
mitement paved the way to the satisfaction of the participants and made 
possible to organize this book without delay, 

• the conference secretariat, Chiara Giammarco and Marlyse Taric, for effi- 
ciently materializing all of the needed arrangements. 
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The Working Group provides an international forum for the exchange of in- 
formation on technical, economic, and social impacts and experiences with 

databases and database applications. The Working Group was founded in 

1974. 

AIMS 

• For the benefit of society, to promote visibility and to increase the impact 
of research and development in the database area, especially in the fields 
defined in the scope of the working group. 

• To promote quality and relevance of academic and industrial research and 
development in the database area. 

• To promote ethical behavior and appropriate recommendations or guide- 
lines for research related activities, for example, submission and selection of 
publications, organization of conferences, allocation of grants and awards, 
and evaluation of professional merits and curricula. 

• To promote cooperation between researchers and with other established 
bodies and organizations pursuing the above aims. 

• To contribute to assessing the scientific merits and practical relevance of 
proposed approaches for data and knowledge management. 



SCOPE 

The notion of database has evolved to include systems that represent, store 
and enable manipulation of data, information and knowledge in a wide spec- 
trum of forms: ranging from tuples to rules, text, images, sounds and others, 
with their corresponding operators, usage and management. 

The group’s interest spans over formalisms, models, architectures, tech- 
niques and methodologies for the purpose of designing and realizing such 
database systems. These currently include in particular: 

• new models, languages and theories for database design and representation 

• new architectures and techniques, e.g. data warehouses, data mining, mul- 
timedia and spatio-temporal databases 

• impact of new communication technologies, such as Internet, broadband 
networks or wireless communications 

• understanding, reuse and interoperation of existing data stores 

• visual user interfaces and information visualization 

• new methodologies for building database applications 
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ACTIVITIES 

Regular activities the working group include meetings and working confer- 
ences. Other types of activities are organized on an ad-hoc basis. 



MEETINGS 

The group usually meets two to three times a year, in different places. The 
major part in meetings is devoted to research presentations and discussions by 
the participants. This scientific part may be focused on a specific field, or left 
entirely open to all topics of interest. Meetings include an administrative part, 
where decisions about appropriate policies to achieve the group’s goals are 
taken and all matters related to implementation of these policies are discussed. 



WORKING CONFERENCES 



The organization of working conferences is an IFIP preferred means to pro- 
mote scientific advances. A working conference is characterized by focus on 
quality rather than quantity of presentations, ample time for present at ion/- 
discussion of contributions, single stream of sessions, opportunity for impro- 
vised break-outs, limited attendance and a pleasant working environment to 
promote informal discussions and interchange of ideas. WG 2.6 is running two 
successful series of working conferences, DS and VDB. 



• Database Semantics (DS) conferences focus on semantic issues in various 
respective fields of interest to the group. Seven conferences have been held 
since 1985. Proceedings have been published by either North-Holland or 
Chapman & Hall. The next upcoming conference will focus on semantics 
of multimedia databases (DS-8, New Zealand, January 1999). 

• Visual Database Systems (VDB) conferences are devoted to advances in 
visual interfaces to database systems, information visualization and man- 
agement of visual data (images, videos, maps, ...). Three conferences have 
been held since 1991. VDB-3 proceedings have been published by Chapman 
& Hall. VDB-4 will be held in Italy, May 1998. VDB-5 is expected to be 
held in Fukuoka, Japan, in the year 2000. 
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MEMBERSHIP 

Membership into the working group is on an individual basis and implies a 
committement for active participation into presentations and discussions at 
the meetings, and preferably participation into working conferences. Mem- 
bership is by invitation. Potential new members are invited to participate as 
observers into three group’s meetings. This allows both parties to evaluate the 
interest of the new membership. After three meetings, the group may decide 
to invite the observer to become a member. The proposal for membership then 
proceeds to the parent committee (IFIP TC2) for formal approval. Member- 
ship term is unlimited, but membership may be dropped by the group if a 
member does not participate into three consecutive meetings. Persons may 
also be invited to participate into a single meeting. Persons interested in be- 
coming an observer may contact the chairperson of the working group. 



FURTHER INFORMATION 

Please check the IFIP 2.6 web site at http://hermes.informatik.uni-ulm.de:80/- 
dbis/IFIP-WG2.6/ 
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OLAP Mining: An Integration 
of OLAP with Data Mining 

Jiawei Han 

Intelligent Database Systems Research Laboratory 

School of Computing Science, Simon Fraser University, British Columbia, Canada 
E-mail: han@cs.sfu.ca 



Abstract 

OLAP mining is a mechanism which integrates on-line analytical processing 
(OLAP) with data mining so that mining can be performed in different por- 
tions of databases or data warehouses and at different levels of abstraction 
at user’s finger tips. With rapid developments of data warehouse and OLAP 
technologies in database industry, it is promising to develop OLAP mining 
mechanisms. 

With our years of research into data mining, an OLAP-based data min- 
ing system, DBMiner, has been developed, where OLAP mining is not only 
for data characterization but also for other data mining functions, including 
association, classification, prediction, clustering, and sequencing. Such an in- 
tegration increases the flexibility of mining and helps users find desired knowl- 
edge. In this paper, we introduce the concept of OLAP mining and discuss 
how OLAP mining should be implemented in a data mining system. 



1 INTRODUCTION 

With an enormous amount of data stored in databases and data warehouses, 
it is increasingly important to develop powerful data warehousing and data 
mining tools for analysis of this collected data and mining interesting knowl- 
edge from it (Fayyad, Piatetsky-Shapiro, Smyth & Uthurusamy 1996). 

Among many different designs and architectures of data mining systems, 
OLAP mining, which integrates on-line analytical processing (OLAP) with 
data mining, is a promising direction based on the following reasoning. 



1. Data mining tools need to work on integrated, consistent, and cleaned 
data, which often require data cleaning and data integration as prepro- 
cessing steps (Fayyad et al. 1996). A data warehouse constructed by such 
preprocessing serves as a valuable source of cleaned and integrated data for 
on-line analytical processing as well as for data mining. 

Data Mining and Reverse Engineering S. Spaccapietra & F. Maryanski (Eds.) 
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2. OLAP mining facilitates interactive exploratory data analysis. Users of- 
ten like to traverse flexibly through a database, select any portions of 
relevant data, analyze data at different levels of abstraction, and present 
knowledge/results in different forms. OLAP mining provides such a tool 
for drilling, pivoting, filtering, dicing and slicing on any sets of data in 
data cubes, for analyzing data at different levels of abstraction, and for 
interacting flexibly with the mining engine based on intermediate mining 
results. 

3. It is often difficult for users to predict what kinds of knowledge to be mined 
beforehand. By integration of OLAP with multiple data mining modules, 
OLAP mining provides flexibility for users to select desired data mining 
functions and swap data mining tasks dynamically. 



The above observations motivate us to study the desired ways to perform 
OLAP mining and their efficient implementation methods. 

With our years of research and development, an OLAP data mining system, 
DBMiner, has been developed by integration of database, OLAP and data min- 
ing technologies (Han & Fu 1995, Han & Fu 1996, Han, Chiang, Chee, Chen, 
Chen, Cheng, Gong, Kamber, Liu, Koperski, Lu, Stefanovic, Winstone, Xia, 
Zaiane, Zhang & Zhu 1997). The system mines various kinds of knowledge at 
multiple levels of abstraction from large relational databases and data ware- 
houses efficiently and effectively. In this paper, we examine the principles of 
OLAP mining and study its implementation techniques with the DBMiner 
system as a running example. 

The remaining of the paper is organized as follows. Section 2 is a brief 
introduction to OLAP and data mining technologies. In Section 3, we examine 
OLAP mining and the desired OLAP mining functions. In Section 4, methods 
for efficient implementation of OLAP mining are studied. In Section 5, we 
summarize the study and propose some future research topics. 



2 OLAP AND DATA MINING 

To understand what is OLAP mining, we need to first understand what is 
OLAP and what is data mining. 

OLAP (On-Line Analytical Processing) refers to a set of data analysis 
techniques developed for analyzing data in data warehouses since 1990s. A 
data warehouse stores a large collection of subject-oriented, integrated, time- 
variant, and nonvolatile data in support of management’s decision-making 
process. It presents a multidimensional, logical view of the data, and is hence 
called a multidimensional database or data cube. A point in a data cube stores 
a consolidated measure of the corresponding dimension values in a multidi- 
mensional space. OLAP operations, such as drill-down, roll-up, pivot, slice. 




Integration of OLAP with data mining 



5 







Figure 1 A 3-D data cube of a market data warehouse 



dice, etc., are the ways to interact with the data cube for multidimensional 
data analysis. 



Example 1 A market data warehouse shown in Figure 1 consists of three 
dimensions: store, Hem, and time, and two measures, number- of-umts-sold 
and profit. A concept hierarchy is associated with each dimension as follows, 



store{store.id, street, city, province, country) 
item{itemJd, item^name, brand, category) 
time{minute, hour, day, month, quarter, year). 

Drill-down or roll-up operations can be performed along each dimension. 
For example, one may start with a low-level cube which consists of store Jd, 
item^name, and hour, and roll-up to examine the number of items sold by 
category, by city, and by quarter and then drill-down to see the number of 
items sold by item name, by store, and by month. 

In addition to performing drill-down and roll-up operations, there are sev- 
eral other popularly used OLAP operations: slicing, dicing and pivoting. Slic- 
ing is the extraction, from a data cube, of summarized data for a given 
dimension-value, or slice. For example, one may slice on a particular store 
to examine the total number of items solded by item name and by day. Dicing 
is the extraction of a “subcube” , or intersection of several slices of the data 
cube. For example, to examine various kinds of TVs sold in Sears in August 
1997, one need to dice using several constants or range values. Filtering is 
to perform selection on a data cube using some constants. Finally, pivoting 
rotates the axes of a data cube so that one may examine a cube from different 
angles. 

The data cube can be browsed conveniently using the DBMiner cube browser 
as shown in Figure 2, where the size of a cuboid represents the entry count 
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Figure 2 Browsing of a 3-dimensional data cube in DB Miner 



in the corresponding cell, and the “brightness of a cuboid represents the accu- 
mulated amount in the cell. Drilling and slicing/dicing operations can also be 
performed on the data cube with simple button clicking. 

Moreover, OLAP operations in some systems include comprehensive statis- 
tical analysis packages, such as trend analysis, ratios and ranking, charting, 
browsing and surfing, iterative analysis, linear and non-linear modeling, re- 
gression analysis, time-series analysis, and multidimensional or complex cor- 
relation analysis. □ 



To facilitate our discussion of OLAP mining, the popularly used cube trans- 
formation operations, including drilling, rolling, slicing, dicing, filtering, and 
pivoting, are called cubing operations because they lead to the generation 
of new data cubes. Efficient computation of data cubes and efficient imple- 
mentation of OLAP operations have been investigated with a spectrum of 
techniques proposed, such as multiway aggregation of multidimensional arrays 
(Agarwal, Agrawal, Deshpande, Gupta, Naughton, Ramakrishnan h Sarawagi 
1996, Zhao, Deshpande & Naughton 1997), indexing data cubes (Sarawagi 
1997), efficient OLAP computations, etc. These techniques are also essential 
for efficient implementation of OLAP mining. 

Data mining is to discover some nontrivial and interesting knowledge or 
patterns from the data stored in large databases. It consists of several major 
functions as follows. 
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• Characterization which generalizes a set of task-relevant data into a gener- 
alized data cube which can then be used for extraction of different kinds of 
rules or be viewed at multiple levels of abstraction from different angles. In 
particular, it derives a set of characteristic rules which summarizes the gen- 
eral characteristics of a set of user-specified data (called the target class). 
For example, the symptoms of a specific disease can be summarized by a 
characteristic rule. 

• Comparison which mines a set of discriminant rules which summarize the 
features that distinguish the class being examined (the target class) from 
other classes (called contrasting classes). For example, to distinguish one 
disease from others, a discriminant rule summarizes the symptoms that 
discriminate this disease from others. 

• Classification which analyzes a set of training data (i.e., a set of objects 
whose class label is known) and constructs a model for each class based 
on the features in the data. A set of classification rules is generated by 
such a classification process, which can be used to classify future data and 
develop a better understanding of each class in the database. For example, 
one may classify diseases and provide the symptoms which describe each 
class or subclass. 

• Association which discovers a set of association rules (in the form of A 
• • • A Aj jBi A • • • A ” ) at multiple levels of abstraction from the rel- 
evant set(s) of data in a database. For example, one may discover a set 
of symptoms often occurring together with certain kinds of diseases and 
further study the reasons behind them. 

• Prediction which predicts the possible values of some missing data or the 
value distribution of certain attributes in a set of objects. This involves 
finding the set of attributes relevant to the attribute of interest (by some 
statistical analysis) and predicting the value distribution based on the set of 
data similar to the selected object(s). For example, an employee’s potential 
salary can be predicted based on the salary distribution of similar employees 
in the company. 

• Cluster analysis which groups a selected set of data in the database or data 
warehouse into a set of clusters to ensure the interclass similarity is low and 
intraclass similarity is high. For example, one may cluster the houses in 
Vancouver area according to house type, value, and geographical location. 

• Time-series analysis which performs data analyses for time-related data 
in databases or data warehouses, including similarity analysis, periodic- 
ity analysis, sequential pattern analysis, and trend and deviation analysis. 
For example, one may find the general characteristics of the companies 
whose stock price has gone up over 20% last year or evaluate the trend or 
particular growth patterns of certain stocks. 



Efficient data mining methods for large databases have been studied ex- 
tensively, with many interesting methods developed. For example, attribute- 
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oriented induction method for mining characteristic and comparison rules 
(Han k Fu 1996), Apriori algorithm for mining association rules (Agrawal k 
Srikant 1994), Cart algorithm for classification and decision tree construction 
(Quinlan 1993), and CLARANS, DBScan, Birch, and fc-means algorithms for 
clustering analysis (Ng k Han 1994, Ester, Kriegel, Sander k Xu 1996, Zhang, 
Ramakrishnan k Livny 1996, Jain k Dubes 1988). 



3 DESIRED OLAP MINING FUNCTIONS 

By integration of OLAP and data mining, OLAP mining facilitates flexible 
mining of interesting knowledge in data cubes because data mining can be 
performed at multi-dimensional and multi-level abstraction space in a data 
cube. Cubing and mining functions can be interleaved and integrated to make 
data mining a highly interactive and interesting process. 

Here we first examine what are the desired OLAP mining functions. 

1. Cubing then mining: With the availability of data cubes and cubing opera- 
tions, mining can be performed on any layers and any portions of a data 
cube. This means that one can first perform cubing operations to select the 
portion of the data and set the abstraction layer (granularity level) before 
a data mining process starts. 

For example, one may first tailor a cube to a particular subset, such as 
'‘year = 1997\ and to a desired level, such as at the city level for the 
dimension siore^ and then execute a prediction mining module. 

2. Mining then cubing: This means that data mining can be first performed on 
a data cube, and then particular mining results can be analyzed further by 
cubing operations. 

For example, one may first perform classification on a “market” data cube 
according to a particular dimension or measure, such as profit.made. Then 
for each obtained class, such as the high^profit class, cubing operations can 
be performed, e.g., drill-down to detailed levels and examine its character- 
istics. 

3. Cubing while mining: A flexible way to integrate mining and cubing oper- 
ation is to perform similar mining operations at multiple granularities by 
initiating cubing operations during mining. By doing so, the same data 
mining operations can be performed on different portions of a cube or at 
different abstraction levels. 

For example, for mining association rules in a “market” data cube, one can 
drill down along a dimension, such as time, to find new association rules 
at a lower level of abstraction, such as from year to month. 

4. Backtracking: To facilitate interactive mining, one should allow a mining 
process to backtrack one or a few steps or backtrack to a preset marker 
and then explore alternative mining paths. 
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For example, one may classify market data according to profit-made and 
then drill down along some dimension(s), such as store to see its character- 
istics. Alternatively, one may like to classify the data according to another 
measure, cost-of -product^ and then do the same (characterization). This re- 
quires the miner to jump back a few steps or backtrack to some previously 
marked point, and redo the classification. Such flexible traversal along the 
cube at mining is a highly desired feature for users. 

5. Comparative mining: A flexible data miner should allow comparative data 
mining, that is, the comparison of alternative data mining processes. 

For example, a data miner may contain several cluster analysis algorithms. 
One may like to compare side by side the clustering quality of different 
algorithms, even examine them when performing cubing operations, such 
as when drilling down to detailed abstraction layers. 



It is possible to have other combinations in OLAP mining. For example, 
one can perform “mining then mining”, such as first perform classification on 
a set of data and then find association patterns for each class. 

In a large warehouse containing a huge amount of data, it is crucial to pro- 
vide flexibilities in data mining so that a user may traverse a data cube, select 
mining space and the desired levels of abstraction, and test different mining 
modules and alternative mining algorithms at his/her finger tips. By doing 
so, mining will be a highly interactive, enjoyable, and productive process. 



4 EFFICIENT IMPLEMENTATION OF OLAP MINING 

Assuming readers have general knowledge of data warehousing (Chaudhuri & 
Dayal 1997) and data mining (Fayyad et al. 1996), we examine the methods 
for efficient implementation of OLAP mining in this section. 

With recent developments of data warehousing technology, data cubes can 
be computed and accessed efficiently. For example, one may use either a MO- 
LAP (based on multi-dimensional array structures) or a ROLAP (based on 
relational structures) approach for efficient cube storage and computation 
(Agarwal et al. 1996, Gray, Chaudhuri, Bosworth, Layman, Reichart, Venka- 
trao, Pellow k Pirahesh 1997, Zhao et al. 1997), where a cube could be dense 
or sparse. Also, data cubes can be indexed or bit-mapped in several ways 
for efficient accessing (Chaudhuri k Dayal 1997, Sarawagi 1997). With these 
technologies available, our discussion will focus on how OLAP mining should 
be performed in cooperation with data mining functions. 
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4.1 OLAP-based characterization and comparison 

Data characterization summarizes and characterizes a set of task-relevant data 
based on data generalization. For mining multiple-level knowledge, progressive 
deepening {drill-down) and progressive generalization (roll-up) techniques can 
be applied. 

Progressive generalization starts with a conservative generalization process 
which first generalizes the data to a slightly higher level than the primitive 
data in the data cube. Further generalizations can be performed on it pro- 
gressively by selecting appropriate attributes for step-by-step generalization. 

Progressive deepening starts with a relatively high-level generalized cuboid, 
selectively and progressively specializes some of the generalized tuples or at- 
tributes to lower abstraction levels. 

Conceptually, a top-down, progressive deepening process is preferable since 
it is natural to first find general data characteristics at a high abstraction 
level and then follow certain interesting paths to drill down to specialized 
cases. However, from the implementation point of view, it is easier to perform 
generalization than specialization because generalization replaces low level 
data by high ones through ascension of a concept hierarchy. Since a generalized 
cell does not register the detailed original information, it is difficult to get such 
information back when specialization is required later. 

To facilitate specializations on a high-level cuboid, a typical technique is 
to save a set of ^‘lower-level cuhoids^\ especially the “minimally generalized 
cuboid' either at the preprocessing stage (i.e., cube computation stage) or 
in the early stage of generalization. For example, to compute the minimally 
generalized cuboid^ each dimension in the relevant set of data can be gener- 
alized to minimally generalized concepts (which can be done in one scan of 
the database or warehouse), with the measures aggregated correspondingly. 
After that, both progressive deepening and interactive up-and-down can be 
performed with reasonable efficiency: If the data at the current abstraction 
level is to be generalized further, generalization can be performed on it di- 
rectly; on the other hand, if it is to be specialized, the desired result can be 
derived by searching for the closest lower-level cuboid and generalizing such 
a cuboid to appropriate level(s) if necessary. 

An output of the characterizer is shown in Figure 3. 

Comparison is to find a set of discriminant rules which distinguish the 
general features of a target class from that of the contrasting class(es) specified 
by a user. It is implemented as follows. 

First, the set of relevant data in the database has been collected by query 
processing and is partitioned respectively into a target class and one or a set 
of contrasting class(es). Second, attribute-oriented induction is performed on 
the target class to extract a prime target cube^ where a prime target cube is a 
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Figure 3 Graphical output of the Characterizer of DB Miner 



generalized cuboid in which each attribute contains no more than but close to 
the threshold value of the corresponding attribute. Then the concepts in the 
contrasting class(es) are generalized to the same level as those in the prime 
target cube, forming the prime contrasting cube. Finally, the information in 
these two clcisses is used to generate qualitative or quantitative discriminant 
rules. 

Moreover, interactive drill-down and roll-up can be performed synchronously 
in both target class and contrasting class(es) in a similar way as in character- 
ization. 

How can multi-level characterization and comparison be integrated into 
OLAP mining? Since at each step of drill-down or roll-up, characterization 
and comparison produce a new cuboid, with the same data structure, it is 
inherently suitable for integration with OLAP mining. That is, any mining 
module can treat the result of characterization and comparison as a data cube 
and mining can be performed directly on such a resulting cube. Furthermore, 
for any mining result on which cubing operations can be performed, charac- 
terization and comparison can be done as well. 



4.2 OLAP-based association 

Based on many studies on efficient mining of association rules (Agrawal & 
Srikant 1994, Srikant k Agrawal 1995, Han k Fu 1995), a multiple-level as- 
sociation rule miner (called “associator” ) has been implemented in DBMiner. 
An output of the associator is shown in Figure 4. 

Different from mining association rules in transaction databases, a relational 




12 



Part One Invited Talks 




Figure 4 Graphical output of the Associator of DBMiner 



cLSsociator may find two kinds of associations: inter- attribute association and 
intra- attribute association. The former is an association among different at- 
tributes; whereas the latter is an association within one or a set of attributes 
formed by grouping of another set of attributes. This is illustrated in the 
following example. 

Example 2. Suppose the “course_taken” relation in a university database 
has the following schema: 

course Jaken = {student Jd^ course ^ semester^ grade). 



Intra- attribute association is the association among one or a set of attributes 
formed by grouping another set of attributes in a relation. For example, the 
associations between each student and his/her course performance is an intra- 
attribute association because a set of attributes, "‘course, semester, grade^\ 
are grouped according to student Jd, for mining associations among the courses 
taken by each student. From a relational database point of view, a relation so 
formed is a nested relation obtained by nesting “{course, semester, graded 
with the same student Jd. Therefore, an intra-attribute association is an asso- 
ciation among the nested items in a nested relation. 

Inter- attribute association is the association among a set of attributes in 
a flat relation. For example, the association between course and grade, such 
as “the courses in computing science tend to give good grades'^ is an inter- 
attribute association. 

Two associations require different mining algorithms. 

For mining intra-attribute associations, a data relation can be transformed 
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into a nested relation in which the tuples which share the same values in the 
nesting attributes are merged into one. For example, the conrseJaken relation 
can be folded into a nested relation with the schema, 



course Jaken = (student Jd, course Jiistory) 
course Jiistory — (course, semester, grade). 



With such transformation, it is easy to derive association rules like ''90% 
senior CS students tend to take at least three CS courses at 300-level or up in 
each semesteP^ . Since the nested tuples (or values) can be viewed as data items 
in the same transaction, the methods for mining association rules in transac- 
tion databases, such as Apriori (Agrawal & Srikant 1994), can be applied to 
such transformed relations in relational databases. □ 



Moreover, it is preferable to have some user-specified constraints to guide an 
association rule mining process. Such constraints can be specified in a meta- 
rule (or meta-pattern) form (Kamber, Han & Chiang 1997), which confines the 
search to specific forms of rules. For example, a meta-rule "P(x, y) — > Q(x, y, z 
where P and Q are predicate variables matching different properties in a 
database, can be used as a rule-form constraint in the search. 

The multi-dimensional data cube structure facilitates efficient mining of 
multi-level, inter-attribute association rules. A count cell of a cube stores the 
number of occurrences of the corresponding multi-dimensional data values; 
whereas a dimension count cell stores the sum of counts of the whole dimen- 
sion. With this structure, it is straightforward to calculate the measures such 
as support and confidence of association rules based on the values in these 
summary cells. A set of such cuboids, ranging from the least generalized one 
to rather high level ones, facilitate mining of association rules at multiple 
levels of abstraction. 

Association mining module generates a set of association rules. Since the 
rule form is quite different from a data cube structure, it is not easy to in- 
tegrate the mining results with other mining/cubing processes. One choice 
is to take any rule which contains a few connected nodes as a cuboid, from 
which characteristics can be displayed and drill-down or roll-up can be per- 
formed. Another choice is to take a node as a one-dimensional cuboid and 
show data distribution and add additional attributes for cubing and mining. 
The third choice is to simply backtrack to a point where cubing and other 
mining operations can take place naturally. 
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Figure 5 Graphical output of the Classifier of DBMiner 



4.3 OLAP-based classification 

Data classification is to develop a description or model for each class in a 
database, based on the features present in a set of class-labeled training data. 

There have been many classification methods studied, including decision- 
tree methods, such as ID-3 and C4.5 (Quinlan 1993), statistical methods, 
neural networks, rough sets, etc. Recently, some database-oriented classifica- 
tion methods have also been investigated (Mehta, Agrawal & Rissanen 1996). 

Our classification method consists of four steps: (1) collection of the relevant 
set of data and partitioning of the data into training and test data, (2) analysis 
of the relevance of the attributes, (3) construction of classification (decision) 
tree, and (4) test of the effectiveness of the classification using the test data 
set. 

Attribute relevance analysis is performed based on the analysis of an un- 
certainty measurement, a measurement which determines how much an at- 
tribute is in relevance to the class attribute. Other measurements, such as 
entroy-based information gain (Quinlan 1993) and Gini index (Mehta et al. 
1996), can be used for relevance analysis as well. Several top-most relevant 
attributes are retained for classification analysis; whereas the weakly or irrel- 
evant attributes are not considered in the subsequent classification process. 

In the classification process, our classifier adopts a generalization-based 
decision-tree induction method which integrates OLAP data cube technology 
with a decision-tree induction technique, by first performing minimal general- 
ization on the set of training data, and then performing decision tree induction 
on the generalized data. 

Since a generalized cell comes from the generalization of a number of orig- 
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inal cells, the count information is associated with each generalized cell and 
plays an important role in classification. To handle noise and exceptional 
data and facilitate statistical analysis, two thresholds, classification threshold 
and exception threshold, are introduced. The former is used for justification 
whether it is needed to continue classification on a node if a significant set 
of the examples of the node belongs to a single class; whereas the latter is 
used to terminate further classification on a node if the node contains only a 
negligible number of examples. 

There are several alternatives for doing generalization before classification: 
A data set can be generalized to either a minimally generalized abstraction 
level, an intermediate abstraction level, or a rather high abstraction level. Too 
low an abstraction level may result in scattered classes, bushy classification 
trees, and difficulty at concise semantic interpretation; whereas too high a 
level may result in the loss of classification accuracy. 

For OLAP mining, classification can be associated with other cubing and 
mining functions as follows. For any cubing result, one attribute can be se- 
lected as class attribute and classification can be performed on the correspond- 
ing cuboid in the same way as our cube-based classification process. For any 
classification result, each class node can be treated as a portion of the cube 
selected by the class constraint. Subsequent cubing and mining operations can 
be performed on the selected class. 

The multi-level classification process has been implemented in the DBMiner 
system. An output of the DBMiner classifier is shown in Figure 5. 



4.4 OLAP-based prediction 

A predictor predicts data values or value distributions on the attributes of 
interest based on similar groups of data in the database. For example, one 
may predict the amount of research grants that an applicant may receive 
based on the data about the similar groups of researchers. 

The power of data prediction should be confined to the ranges of numerical 
data or the nominal data generalizable to only a small number of categories. 
It is unlikely to give reasonable prediction on one's name or social insurance 
number based on other persons’ data. 

For successful prediction, the factors (or attributes) which strongly influ- 
ence the values of the attributes of interest should be identified first. This 
can be done by the analysis of data relevance or correlations by statistical 
methods, decision-tree classification techniques, or be simply based on expert 
judgement. To analyze attribute relevance, the uncertainty measurement sim- 
ilar to the method used in our classifier is applied. This process ranks the 
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Figure 6 Graphical output of the Predictor of DBMiner: numeric predictive 
attribute (left) and categorical predictive attribute (right) 



relevance of all the attributes selected and only the highly ranked attributes 
will be used in the prediction process. 

After the selection of highly relevant attributes, a generalized linear model 
has been constructed which can be used to predict the value or value distri- 
bution of the predicted attribute. If the predictive attribute is a numerical 
data, a set of curves are generated, each indicating the trend of likely changes 
of the value distribution of the predicted attribute. If the predictive attribute 
is a categorical data, a set of pie charts are generated, each indicating the 
distributions of the value ranges of the predicted attribute. 

When a query probe is submitted, the corresponding value distribution 
of the predicted attribute can be plotted based on the curves or pie charts 
generated above. Therefore, the values in the set of highly relevant predictive 
attributes can be used for trust able prediction. 

The prediction output has two forms of presentation: curve graph and pie 
chart depending whether the predictive attribute is a numeric attribute or 
a categorical attribute. When the predictive attribute is a numeric one, the 
output is a set of curves as shown in the left half of Figure 6; whereas when 
the predictive attribute is a categorical one, the output is a set of pie charts 
as shown in the right half of Figure 6. 

OLAP mining can be integrated with the prediction module as follows. For 
any predicted class, the class can be identified by a class selection criteria and 
its characteristics can be displayed. Then cubing operations can be performed 
on such a selected cuboid. Alternatively, one may backtrack to a point before 
prediction is performed and continue the exploration of other features of the 
previously selected data cube. 
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4.5 OLAP-based clustering analysis 

Data clustering, also viewed as “unsupervised learning” , is a process of par- 
titioning a set of data into a set of classes, called clusters, with the members 
of each cluster sharing some interesting common properties. A good cluster- 
ing method will produce high quality clusters, in which the intra-class (i.e., 
intra-cluster) similarity is high and inter-class similarity is low. 

Clustering has many interesting applications. For example, it can be used 
to help marketers discover distinct groups in their customer bases and develop 
targeted marketing programs. 

Data clustering has been studied in statistics, machine learning and data 
mining with different methods and emphases. Many clustering methods have 
been developed and applied to various domains, such as data classification 
and image processing. 

Data mining applications deal with large high dimensional data, and fre- 
quently involve categorical domains with concept hierarchies. However, most 
of the existing data clustering methods can only handle numeric data, or can- 
not produce good quality results in the case where categorical domains are 
present. 

Our cluster analyzer is based on the well-known k-means paradigm. Com- 
paring to the other clustering methods, the k-means based methods are promis- 
ing for their efficiency in processing large data sets. However, their use is often 
limited to numeric data. To adequately reflect categorical domains, we have 
developed a method of encoding concept hierarchies. This enables us to de- 
fine a dissimilarity measure that not only takes into account both numeric 
and categorical attributes, but also at multiple levels. Due to these modifi- 
cations, our cluster analyzer can cluster large data sets with mixed numeric 
and categorical attributes in a way similar to k-means. It can also perform 
multi-level clustering and select a desired level by a comparison of the cluster- 
ing quality at different levels. On the other hand, the user or the analyst can 
direct the clustering process by either selecting a set of relevant attributes for 
the requested clustering query, or assigning a weight factor to each attribute, 
or both, so that increasing the weight of an attribute increases the likelihood 
that the algorithm will cluster according to that attribute. 

OLAP mining can be integrated with cluster analyzer as follows. For any 
cluster so obtained, its characteristics can be displayed and cubing/mining 
operations can be performed on such a selected cluster. Alternatively, one 
may backtrack to a point before clustering is performed and continue the 
exploration of other features of the previously selected data cube. 
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4.6 Backtracking and comparative mining analysis 

Backtracking is convenient for OLAP mining since a user may like to tenta- 
tively dig deep following some mining paths and later try alternatives if there 
have not been desired interesting patterns found. 

One suggested technique for implementation of backtracking in OLAP min- 
ing is as follows. First, a status vector should be saved in a backtrack stack 
(if the backtrack pattern is simply tracing back step by step) or a backtrack 
list (if position marking or other traversal patterns is desired). The cuboid(s) 
associated with the vector should also be saved on the disk and linked with 
the vector. When backtracking, such a stack/list is used to trace back the ap- 
propriate status pointers. When a session is complete, all the saved backtrack 
pointers, and their associated vectors and cuboids should be deleted to release 
the disk space occupied. 

Comparative mining analysis can be implemented similarly. For compar- 
ative analysis of two mining tasks, one needs to fork a new path of min- 
ing/cubing process stream and show both streams in separate windows. Com- 
parative displays can also be synchronized when necessary by bundling to- 
gether the two similar mining/cubing processes. 



5 DISCUSSION AND CONCLUSIONS 

OLAP mining integrates on-line analytical processing with data mining which 
substantially enhances the power and flexibility of data mining and makes 
mining an interesting exploratory process. 

In this paper, we discussed the principles and some implementation tech- 
niques of OLAP mining by taking the OLAP mining functions of the DBMiner 
system as running examples. 

With multiple data mining functions available, one question which natu- 
rally arises is how to determine which data mining function is the most ap- 
propriate one for a specific application. To select an appropriate data mining 
function, one needs to be familiar with the application problem, data char- 
acteristics, and the roles of different data mining functions. Sometimes one 
needs to perform interactive exploratory analysis to observe which function 
discloses the most interesting features in the database. Therefore, the building 
of exploratory analysis tools and the construction of an application-oriented 
semantic layer are two important solutions. OLAP mining provides an ex- 
ploratory analysis tool, however, further study should be performed on the 
automatic selection of data mining functions for particular applications. 

Another popularly posed question is how these data mining techniques are 
different from the set of existing statistical data analysis tools. Data min- 
ing is the confluence of multiple disciplines, including database systems, data 
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warehouses, statistics, visualization, machine learning, and information sci- 
ence. Previous work on statistics has provided some foundational work for 
data mining. Most effective data mining systems are extending the power of 
database systems or data warehouse systems and re-examining the methods 
studied in statistics, visualization, and machine learning. Many methods de- 
veloped in data mining systems are novel and scalable, and integrates well 
with existing database systems, which are quite different from existing statis- 
tical analysis tools and machine learning packages. Thus, data mining forms 
a new direction in the research into the methods for the analysis of data and 
knowledge in large databases and data warehouses. 

We are currently working on the further enhancement of the power and 
efficiency of OLAP mining of DBMiner for exploratory data mining, including 
the improvement of system performance and rule discovery quality for the 
existing functional modules, and the development of techniques for mining 
new kinds of rules, especially on time-related data, and visual data mining. 

Another important task for future study is the extension of OLAP min- 
ing techniques towards advanced and/or special purpose database systems, 
including extended-relational, object-oriented, text, spatial, temporal, multi- 
media, and heterogeneous databases and Internet information systems. We 
will report the progress on OLAP mining of complex types of structured, 
semi-structured and nonstructured data in the future. 
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Abstract 

In order to pose effective queries to Web sites, some form of site data model 
must be implicitly or explicitly shared by users. Many approaches try to com- 
pensate for the lack of such a common model by considering the hypertextual 
structure of Web sites; unfortunately, this structure has usually little to do 
with data semantics. In this paper a different technique is proposed that al- 
lows for both navigational and logical/conceptual description of Web sites. 
The data model is based on WG-log, a query language based on the graph- 
oriented database model of GOOD (Gyssens et al. 1997) and G-log (Paredaens 
et al. 1995), which allows the description of data manipulation primitives 
via (sets of) graph(s). The WG-log description of a Web site schema is lexi- 
cally based on standard hypermedia design languages, thus allowing for easy 
schema generation by current hypermedia authoring environments. The use 
of WG-log for queries allows graphic query construction with respect to both 
the navigational and the logical parts of schemata. Site schemata are man- 
aged by Schema Robots, which assist clients in the process of identification 
and retrieval of a set of candidate schemata. On the basis of the set of can- 
didate schemata, the client may then query individual Web sites; extensive 
data caching is used to avoid flooding resulting from an excessive number 
of candidates. A remote Query Manager process, running side by side with 
standard Web servers, manages query execution and handles the presentation 
of the results to the client. Our schema is particularly suited for Intranets, 
while allowing for a smooth migration of Internet Web sites as more and 
more of them are produced on the basis of hypermedia design and generation 
methodologies. 



Keywords 

World Wide Web, query languages, schema-based semantics of Web sites 

Data Mining and Reverse Engineering S. Spaccapieira & F. Maryanski (Eds.) 

© 1998 IFIP. Published by Chapman & Hall 




22 



Part One Invited Talks 



1 INTRODUCTION 

The steady growth in the amount of data published via the World Wide 
Web (WWW) has led to a number of attempts to index Web’s contents. 
Initially, these attempts only tried to collect and index title-like information 
about every reachable page of data on the WWW, and then build Boolean 
keyword searches into that database. Today, one could hardly find a Web 
search engine relying on term indexing alone. However, although many of the 
new search engines present a sophisticated query interface, the results they 
deliver are perceived by the user community as unsatisfactory. In our opinion, 
this situation is mainly due to the following three causes: 



• Noise effect 

In the current systems, query language interfaces suggest precise semantics 
while the underlying keyword search mechanism remains mainly syntactical 
in nature and therefore prone to the well-known noise and silence side- 
effects. Sophisticated indexing techniques exploiting HTML tagging may 
relieve this problem, but by no means solve it completely. 

• Flatness of results 

Currently, query results are flat lists of pages that do not capture the 
underlying structure of the searched Web sites. As a consequence, retrieved 
information neither is easily presented to the user nor can be eflftciently 
reorganized. 

• Non-HTML objects indexing The increasing presence of non-HTML 
objects (Hu et al 1996) attached to Web pages (such as multimedia clips 
or Java applets) jeopardizes automatic index construction and update. 



The above considerations suggest to complement keyword-based searching 
with database-style support for querying the Web. Several research projects 
addressed this problem, and are reviewed in Section 2 



1.1 An outline of our approach 

Our approach, differently from the others, is based on the following two prin- 
ciples: 



• The availability of a schema is a prerequisite for the development of an 
eflFective query mechanism, since the schema carries most of the semantic 
information needed for querying. 

• If a well specified methodology is used to design a Web site, some notion 
of schema is present during the site design process 




Structuring and querying Web sites 



23 



Logical Description Languages 
(e.g. G-Log, GOOD) 



Mixed logical/navigational 
description languages 
(Automatic publishing tools 
e.g. MS Frontpage) 



Real Web Pages 
(static or dynamic) 





Figure 1 Links between conceptual and hypermedia description languages 



So, this paper proposes to reuse site design artefacts to attain (semi) automatic 
construction of schemata for Web sites. As outlined in Fig. 1, many popular 
authoring environments for Web sites already hint at the idea of some sort 
of navigational schema to be chosen by the user as the basis of automatic 
site generation. Moreover, several research prototypes of Web site generators 
(Praternali et al. 1997)) based on hypermedia design languages such as HDM 
(Garzotto et al 1995) (Hypermedia Design Method), YOO (Balasubramanian 
et al 1995), RMM (Isakowitz et al 1995) and the like, are becoming available. 
This suggests the possibility of semi or fully automatic generation of schemata 
during the design process, allowing for smooth migration to a more organized 
Web as more and more sites are produced on the basis of standard design 
and generation methodologies. In our approach, Web site schemata form a 
distributed hierarchy, managed by server processes called Schema Robots. A 
Schema Robot is a server process which mantains and provides information 
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about Web sites schemata, stored and presented in WG-log form, to the clients 
in execution over the Net. Although in this paper we will not elaborate on 
site classification issues, it is worthwhile to remark that Schema Robots need 
not be all equal; for instance, a set of domain schema robots might form a 
distributed partitioned repository of all known schemata on the Net, while 
category schema robots might allow for a subject-related search. In this per- 
spective, Robots can also be regarded as browsable similarity-based hypertexts 
of WG-log schemata, providing semantically rich information about Web sites. 
This hypertextual structure should not be regarded simply as “another tech- 
nique” for schema identification; the availability of large, searchable schema 
repositories may prove a significant contribution to the much needed Web 
restructuring via resource indexing systems (Budi et al 1996). All queries 

are formulated w.r.t. a site schema^ supposed to be known in advance to the 
client formulating the query. In order to formulate and execute the query, the 
following steps are performed: 

1. Schema identification 

2. Schema retrieval and query formulation 

3. Instance retrieval 

4. Presentation of results 

In the following, we shall describe each step in some detail. 

• Schema identification 

To get the best results, queries to the Web must be formulated on the 
basis of a known schema. In order to help the user in identifying a suit- 
able schema w.r.t. the planned query, facilities for schema browsing and 
keyword-searching, together with a Thesaurus are available at Robot sites. 
The Thesaurus is mainly intended to provide a browsable controlled vo- 
cabulary to help the user in getting acquainted with the schema data dic- 
tionary. Note that, as the number of stored schemata will grow, automatic 
Thesaurus inizialization and update (Damiani et al 1995) will become cru- 
cial for the Robots’ performance. 

• Schema retrieval 

With the help of the keyword-based information provided by the user, the 
Schema Robot’s search engine identifies candidate schemata to be proposed 
to the user. With the help of a graphical editor, the user constructs ap- 
propriate queries for the schemata identified by the Robot; therewith, on 
reception of the schema-based user queries, the Robot consults an instance 
cache to discard sites that surely do not contain the desired information. 
This cache holds, for each schema, the values of selected attributes of some 
entities; aging mechanisms are required to ensure its consistency. The result 
of this second step is a list of Web sites together with valid references to 
them, i.e. the network addresses of the associated query manager processes. 
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Figure 2 System architecture 



• Instance retrieval 

The query is now broadcast to all query managers running on candidate 
Web sites. If the candidate Web site is based on an underlying database, 
the Query Manager process can provide an interface to the existing DBMS. 
In. general, however, the Query Manager will maintain a copy of the WG- 
log schema and some indexing information linking navigational entities to 
(a list of) URLs. The resulting instance is computed at the target site; 
efficiency is critical at this moment, both in terms of time of computation 
of the resulting instance and of amount of gathered data that has to be 
trasmitted to the originating site. 

• Presentation of Results 

This last step performs gathering and presentation of the result pages, on 
the basis of the data structures mapping HTML pages and the navigational 
part of the schema. This is a very interesting problem, since the amount of 
information contained in the resulting instance may be huge and the user 
must be presented with a synthesis of the available information, organized 
in such a way that he/she can choose to browse the resulting instance 
according to different perspectives. The synthetical result submitted to the 
user must enable him/her to formulate one or more appropriate goals that 
select exactly the needed information from the computed instance, to be 
transported to the originating site. 



Fig.2 summarizes the system architecture. This paper is organized as follows: 
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Web query approachet 




Figure 3 A semantics-based classification of research approaches to WWW 
querying 



Section 2 presents a survey of existing techniques, Sections 3 and 4 describe the 
data model and language of WG-log; Section 5 presents the naive computation 
of WG-log queries, while Section 6 draws the conclusions and outlines future 
research on the same subject. 



2 RELATED WORK 

The huge amount of data published via the World Wide Web has led to a 
number of research efforts on techiques to index, query and restructure WWW 
sites contents. In this section we provide a brief overview of related work (see 
also (Torlone 1996)). Our discussion of previous work is based on how the 
various approaches deal with semantics representation. Figure 3 summarizes 
the overview. 

• Free text indexing - No semantics representation Early approaches to Web 
indexing tried to collect and index title-like information about every reach- 
able page of data on the WWW and then build Boolean keyword searches 
into the resulting document. The JumpStation (Fletcher 1990), probably 
the first well-known system designed to index WWW information, collected 
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only information marked by HTML <TITLE>-tags from pages it encoun- 
tered. Many current Web search engines are still partially based on this 
approach, where search results are flat lists of HTML pages, completely 
unrelated to the hypertextual structure of the sites they come from. In the 
past few years, as the amount of WWW data continued to increase, users 
grew unsatisfied with pure keyword matching. Nowadays, one could hardly 
find a Web search engine relying on terms indexing alone. Some keyword- 
based indexes, like the World Wide Web Worm (WWWW) (McBryan et 
al. 1994), the WebCrawler (Pinkerton 1997) and Lycos (Lycos 1997) try to 
complement keyword indexing by taking into account the HTML document 
structure in order to make educated guesses about semantics. For example, 
Lycos summarizes the actual content of documents by taking advantage of 
human-tagged information (HTML headings), of often- appearing keywords 
and of the introductory text that generally is positioned at the beginning 
of a file. 

• Semantics representation via taxonomies 

Several search engines do not use keyword indexing but exploit a taxonomy 
representing sites’ content. The popular Yahoo (Yahoo! 1997) search engine 
relies on a broad hierarchical classification systems of subjects, much simi- 
lar to those used by the Library of Congress or by the ACM Classification 
of Computer Science Topics. Yahoo’s success has spawned multiple similar 
tools, all based on the idea of providing large, monolithic servers holding 
indexes of site contents (Point (Lycos-2 1997), Magellan (McKinley 1997), 
and others). Recently, evolvable taxonomies were proposed, such as the 
one developed by the CommerceNet Consortium (Hamilton 1997). In or- 
der to exploit taxonomy classification together with free text searching, 
obtaining the power of a full- hedged text-retrieval system. Meta-search ser- 
vices (MetaCrawler (Selberg et al. 1995), WebCompass (Quarterdeck 1997), 
SavvySearch (Dreilinger 1997) and others) have been built that are able 
to use powerful free-text indexes like Altavista (Altavista 1997) as subrou- 
tines, querying all available services in parallel and then aggregating the 
results. Although these search engines present a sophisticated query inter- 
face, the results they deliver are currently perceived by the user community 
as unsatisfactory. 

• Structural representation of sites A considerable amount of research has 
been made on how to complement keyword-based searching with database- 
style support for querying the Web. Several projects addressed this prob- 
lem, and three main WWW query languages have been proposed so far: 
WebSQL (Konopnicki et al 1995), WebSQL (Mendelzon et al 1996) and 
WebLog (Lakshmanan et al 1996). The first two languages are modelled 
after standard SQL used for RDBMS, while the third retains the fiavour of 
the Datalog language. However, all these three languages give only a small 
fraction of the power of the original query languages they are based on, since 
they explicitly refrain from semantics representation issues. WebSQL and 
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WebSQL offer a standard relational representation of Web pages, such as 
Document (url , title, text, type, length) , which can be easily con- 
structed from HTML tagging. The user can present SQL-like queries to 
Web sites based on that relational representation. Content- related queries 
(for instance: Document .text = Italy) are mapped in free-text searches 
using a conventional search engine. In addition to similar query capabili- 
ties, WebSQL offers an elementary graph pattern search facility, allowing 
users to search for simple paths in the graph representing the navigational 
structure of a Web site. Finally, WebLog (and its enhancements, (Giannotti 
et al 1997)) propose an 0-0 instance representation technique which leads 
to a powerful deductive query language, fully equipped with recursion; but 
again it lacks any representation of data semantics. 

• Instance^based semantics representation Nowadays it is widely recognized 
that to effectively build Web-based services, developers must be able to 
impose impose some sort of semantic structure upon Web sites in order to 
support efficient information capture (Hamilton 1997). In fact, the subject 
of semantics representation for Web sites in currently actively investigated. 
A well-known technique for instance-based semantics representation is se- 
mantic tagging, i.e. the use of extended HTML tags to represent semantic 
information. The basic idea underlying this approach is that a new kind 
of HTML tags can be used to superimpose a representation of seman- 
tics (based, for instance, on standard entity-relationship technique) on the 
navigational structure of a Web site. Semantic tags can be used to refer 
to an entity the data stored in a Web page and to denote relationships 
as semantic links that are not meant to be followed in navigation only, 
but used for querying purposes. Several variations of the semantic tagging 
idea ((Kogan et al 1997)) have been proposed by various researchers (a 
logic-programming approach is presented in (Loke et al 1997)). Moreover, 
HTML standard committees (W3C 1997) seem to be considering its par- 
tial endorsement. However, no effective query support based on semantic 
tagging is yet available. Other approaches try to address the problem of 
Web indexing and querying in the more general framework of dealing with 
semi-structured data. For instance, the Tsimmis system (Garcia-Molina et 
al 1997) proposes an OEM object model to represent semistructured in- 
formation together with a powerful query language, Laurel For each Web 
site, the user defines OEM classes to be used in its Tsimmis representation. 
Then, an extraction technique based on a textual filter is applied, initializ- 
ing objects from Web pages data. Indeed, Tsimmis additional DataGuide 
facility allows to identify regularities in the extracted instance represen- 
tation to produce a full-fiedged site schema. We are currently exploring 
translation of DataGuide schemata into WG-log in order to add query ca- 
pability to Tsimmis sites. 

• Schema-based semantics representation 
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With the partial exception of Tsimmis, all the approaches described above 
lack an explicit notion of schema. This may be due to the fact that, while 
the advantages of schema-aware query formulation are widely recognized in 
the database context, this technique has been considered unfeasible on the 
WWW because no schema information is normally associated to Web sites. 
However, this situation is evolving cts an increasing number of sites, partic- 
ularly on Intranets, are being designed using well specified design method- 
ologies such as HDM (Garzotto et al 1995), RMM (Isakowitz et al 1995) 
YOO (Balasubramanian et al 1995) and the like. When a methodology is 
used to design a Web site, some notion of site schema is present during the 
site design process. Indeed, many commercial authoring environments for 
Web sites already hint to the idea of some sort of navigational schema to 
be choosen by the user as the basis of automatic site generation. Moreover, 
several research prototypes of Web site generators based on hypermedia de- 
sign languages, are becoming available Some of these tools even translate 
the site schema into a relational representation (Fraternali et al 1997). A 
representation of semantics based on a standard relational schema is also 
used in the Araneus project (Atzeni et al 1997) where Web site crawling 
is employed to induce schemata of Web pages. These fine grained page 
schemata are later to be combined into a site-wide schema, and a special- 
purpose language, Ulixes is used to build relational views over it. Rsulting 
relational views can be queried using standard SQL language, or trasformed 
in autonomous Web sites using a second special-purpose language, Pene- 
lope. It is worth observing that the Araneus approach to schema induction 
requires semi-structured Web site data to be converted in relational tables 
to allow database-style querying. In WG-log, graph-based instance and 
schema representations are used for query, while Web site data remain in 
their original, semi-structured form. 



3 A GRAPH DESCRIPTION LANGUAGE FOR WEB SITES: 
THE DATA MODEL 

The use of design methodologies for hypermedia applications is currently well 
estabilished and widely employed to develop multimedia hypertextual applica- 
tions. Besides allowing conventional or object-oriented design elements, such 
as E/R-like entities or OMT-like classes, nearly all modern hypermedia spec- 
ification languages are associated to a presentation and navigation semantics, 
clearly indicating how entities are to be presented to the user. This approach 
is justified by the fact that no query support is generally offered to hypermedia 
products, whose fruition is based on free user navigation. In our opinion, an 
effective Description and Manipulation Language for Web sites should be able 
to complement the hypermedia model with database-like querying capabili- 
ties. In this Section we describe WG-log, a graph-oriented language support- 
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ing representation of both data model and structural entities. Graphs have 
indeed been an integral part of the database design process, and ever more 
so after the introduction of object-oriented data models; moreover they are 
naturally connected to graphical interfaces. WG-log has its formal basis in the 
graph-oriented language G-log (Paredaens et al 1995). G-log, being designed 
as an object database language, only includes two node types ( representing 
abstract and concrete objects) and one link type (representing logical rela- 
tionships, generally aggregations); WG-log extends G-log by including some 
standard hypermedia design notations that allow for linking conceptual en- 
tities and relationships to navigational concepts such as WWW pages and 
links. As a result, WG-log schemata cleanly denote both logical and struc- 
tural concepts. In addition, there is the possibility to specify some hypertext 
features like index pages or entry points, which are essentially related to the 
hypertext presentation style. In this section we informally present the data 

model and the language of WG-Log; more formal definitions and examples of 
G-Log can be found in (Paredaens et al. 1995, Paredaens et al 1997). Refer- 
ences can also be found in (Garzotto et al 1995), where the HDM hypermedia 
design language is presented. In WG-Log, directed labeled graphs are used as 
the formalism to specify and represent Web site schemata, instances, views 
(also called access structures) , and queries. The nodes of the graphs stand 
for objects and the edges indicate relationships between objects. In WG-log 
schemata, instances and queries we distinguish four kinds of nodes: 

• slots (also called concrete nodes), depicted as ellipses, indicate objects with 
a representable value; instances of slots are strings, texts, pictures, sound 
tracks, numbers, movies or movie frames (depending on the desired gran- 
ularity of representation); 

• entities, depicted as rectangles, indicate abstract objects such as monu- 
ments, professors, or cities; note that an abstract object can be chosen to 
correspond to one or more Web pages, possibly linked to each other in dif- 
ferent ways: it is for the designer to decide which level of granularity the 
schema is meant to convey; 

• collections, represented by a rectangle containing a set of horizontal lines, 
indicate collections or aggregates of objects, generally of the two types 
above; an instance of such a node is the index of all painters in a certain 
gallery (in this case we say that the collection is homogeneous) ; 

• entry points, depicted as triangles, represent the unique page that gives 
access to a portion of the site (or to an alternative view of the site), for 
instance the site home page. To each entry point type corresponds only 
one node in the site instance. It is worth noticing that entry points and 
collection nodes can be used in queries for creating new access structures 
for providing alternative presentations of Web portions. 



We also distinguish four kinds of graph edges: 
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• structural edges^ representing navigational links between pages; such an 
edge may stand for the link between a collection node representing the 
painter index and the entity of type painter; 

• logical edges^ representing logical relationships between objects; such an 
edge might connect painters to their paintings. The presence of a logical 
relationship does not necessarily imply the presence of a navigational link 
between the two entities at the instance level; 

• double edges^ representing a navigational link coupled with a logical link; 
such an edge might connect a painter to his/her paintings, also indicating 
that there is a navigational link that allows paintings to be reached from 
their author; 

• Is^a edges, representing the inheritance relationship between objects; such 
an edge might connect painters and artists (as their generalization). Note 
that Is_a edges are only allowed in a WG-log schema, while do not make 
any sense at the instance and query level. 

As an example of use of these lexical elements. Fig. 5 shows the WG-log 
schema while Fig. 6 contains an istance, namely the experimental WWW site 
whose URL is http://romeo.sci.univr.it/vrtour. This WG-log descrip- 
tion was easily obtained as a part of the design process of the site; an HTML 
page is shown in Fig. 7. It should be also noted that all entities in a schema 
are marked by a unique code that will be exploited during query execution. 



3.1 WG-log schemata 

A (site) schema contains information about the structure of the Web site. 
This includes the (types of) objects that are allowed in the Web site, how 
they can be related and what values they can take. Logical as well as naviga- 
tional (structural) elements can be included into a site schema, thus allowing 
for flexibility in the choice of the level of detail. In fact, index and entry point 
nodes, mainly related to the hypertext presentation style, may be used by 
designers who want to emphasize the hypertextual aspects of their design, 
while such elements can be dispensed with by those designers who are more 
interested in the conceptual contents of the site schema. As far as node gran- 
ularity is concerned, an entity can be chosen to represent one or more site 
pages. For instance in a University site an entity Professor in the schema 
might be mapped in the instance to one page or several pages for each profes- 
sor; in the latter case, pages referring to the same professor may be differently 
linked to each other, the only constraint on different implementations being 
the availability at query time of the information about instance references to 
their schema entities. Formally, a schema contains the following elements: 



• a set of Entry Point labels EP, 
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Figure 4 WG-log lexicon 



• a set 5L of concrete object (or Slot) Labels^ 

• a set ENL of ENtity Labels, containing the special label dummy, 

• a set COL of Collection Labels, 

• a set LEL of Logical Edge Labels, 

• one Structural Edge Label SEL (which in practice is omitted), 

• a set DEL of Double Edge Labels, 

• one Is.a edge label ISA, 

• and a set V of productions * . 



*We require that the productions form a set because we do not allow duplicate productions 
in a schema. 
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The productions dictate the structure of WG-Log instances (which are the 
actual sites); the productions are triples representing the types of the edges 
in the instance graph. The first component of a production always belongs 
to ENL — {dummy} U COL U EP, since only non-concrete objects can be 
related to other objects. The second component is an edge label and the third 
component is an object label of any type. If the edge label is ISA, the two 
node labels must both belong to ENL. A Web site schema can be represented 
as a directed labelled graph, by taking all the objects as nodes and all the 
productions as edges. Note that two nodes might be connected by more than 
one edge. If multiple logical edges connect two nodes, they represent different 
relationships between those objects; the presence of a structural edge between 
two nodes represents the (possible) presence of a navigational direct link be- 
tween the corresponding pages in the site instance: no two nodes, however, 
can be connected by more than one structural edge or by more than one ISA 
edge, since this would be meaningless. Finally, we assume a function tt asso- 
ciating to each slot label a set of constants, which is its domain; for instance, 
the domain of a slot of type image might be the set of all jpeg files. 
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Figure 6 The site instance 



3.2 WG-log instances 

A (Web site) instance over a schema S contains the actual information that 
is stored in the Web site pages. It is a directed labeled graph I = (iV, E). N is 
a set of labeled nodes. Each node represents an object whose type is specified 
by its label. The label A(n) of a node n of AT belongs to EP U 5L U ENL — 
{dummy} U COL*. If A(n) is in EP, then n is an entry point node, and it 
is the only one with label A(n); if A(n) is in 5L, then n is a concrete node 
(or a slot); if A(n) is in ENL, then n is an abstract object, that can coincide 
with one or more or a part of site page; otherwise n is a collection node, 
which means that it contains an aggregation of homogeneous or eterogeneous 
objects. If n is concrete, it has an additional label print{n), called the print 
label, which must be a constant in 7r(A(n)). E is a set of directed labeled edges. 
An edge e of F going from node n to n' is denoted (n,a,n'). a is the label 
of e and belongs to LEL U {SEL} U DEL*. The edges must also conform to 
the productions of the schema, so (A(n),a, A(n')) must belong to V. Besides 
these edges, we also assume an implicit equality edge (a logical edge with an 



*Note that no dummy node is allowed in schemata and instances 
*Note that no ISA edge is allowed in instances. 
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Figure 7 A page of our trial site 
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Results 

Figure 8 A WG-Log solid rule. 



equality sign as label) going from each node of the instance to itself. Figure 6 
contains an instance over the schema of Figure 5. 



4 WG-LOG RULES AND PROGRAMS 

A WG-Log query is a (set of) graph(s) whose nodes can belong to all four node 
types used in schemata, and whose edges can be logical, double or structural. 
This allows for pure database-like and pure navigational queries; w.r.t. our 
experimental site, a query could select all the monuments which are the work 
of a given author as opposed to another listing all the pages in the site linked to 
more than five distincts nodes. It is interesting to remark that this technique 
also opens the interesting possibility of mixed queries, e.g. listing the works 
index page of all authors. In all three cases, in fact, the query results in a 
transformation performed on the instance. WG-Log rules, programs and goals 
can be used to deduce, query and restructure information from the information 
contained in the Web site pages. Rules are themselves graphs, which can be 
arranged in programs in such a way that new views (or perspectives) of (parts 
of) the Web site be available for the user. Like Horn clauses, rules in WG-Log 
represent implications. To distinguish the body of the rule from the head in 
the graph P representing the rule, the part of P that corresponds to the body 
is colored red, and the part that corresponds to the head is green. Since this 
paper is in black and white, we use thin lines for red nodes and edges and thick 
lines for green ones. Figure 8 contains a WG-Log rule over the Web site schema 
of Figure 5. It expresses the query: Find all the monuments whose author is 
Bihiena. Note the use of the Result collection node in the rule: it will build 
an access structure in the resulting instance. The application of a rule r to a 
site instance I produces a minimal superinstance of / that satisfies r. We say 
that an instance satisfies a rule if every matching of the red part of the rule in 
the instance can be extended to a matching of whole rule in the instance. The 
matchings of (parts of) rules in instances are called embeddings. For example, 
the instance I of Figure 6 does not satisfy the rule r of Figure 8. In fact, there 
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Figure 9 A WG-Log rule involving negation. 



is one possible embedding i of the red part of r in /, hence, the monument 
-nodes pertaining to Bibiena of / must be connected to a Result-node and 
this is not the case. Because I does not satisfy r, / is extended in a minimal 
way such that it satisfies r. In this case, the effect of rule application is that 
a Result-node is created and is linked to all the appropriate monument- 
nodes by an SEL -edge. Now the instance satisfies the rule, and no smaller 
superinstance of / does, so this is the result of the query specified by the rule. 
Note that the new instance, obtained from the query, contains a new access 
structure (the node RESULT and its links to all Bibiena’s monuments), which 
allows the retrieval of the nodes in an alternative way, that was not possible in 
the initial instance. We will also see in the sequel that WG-log also allows the 
expression of goals, in order to filter out non-interesting information from the 
instance obtained from the query. Rules in WG-Log can also contain negation 
in the body; we use solid lines to represent positive information and dashed 
lines to represent negative information. So a WG-Log rule can contain three 
colors: red solid (RS), red dashed (RD), and green solid (GS). The rule of 
Figure 9 expresses the query find all monuments of the Venetian period whose 
author is not Bibiena. The instance / of Figure 6 also does not satisfy this 
rule. The two possible embeddings of the RS part of r in / are applicable since 
they cannot be extended to embeddings of the RS and the RD part of r in 
I. Hence, the monument-nodes of I should be connected to a Result-node 
(to extend the found embeddings to embedding also of the GS part of r in /), 
and this is not the case in /. 



4.1 WG-Log Rules and Goals 

We now formally define what WG-Log rules are and when an instance satisfies 
such a rule, or a set of such rules. As in G-log, WG-Log rules are constructed 
from patterns. A pattern over a Web site schema is similar to an instance over 
that schema. There are three differences: 1) in a pattern equality edges may 
occur between different nodes, having the same label, 2) in a pattern concrete 
nodes may have no print label, and 3) a pattern may contain entity nodes 
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with the dummy label, used to refer to a generic instance node. A pattern 
denotes a graph that has to be embedded in an instance, i.e. matched to a 
part of that instance. An equality edge between two different nodes indicates 
that they must be mapped to the same node of an instance. A colored pattern 
over a schema is a pattern of which every node and edge is assigned one of 
the colors RS, RD, or GS. If P is a colored pattern, we indicate by Prs the 
red solid part of P, by Prs,rd the whole red part of P, and by Prs.gs the 
solid part of P. In a generic colored pattern, these parts will not be patterns 
themselves, since they can contain dangling edges. However, for P to be a 
WG-Log rule, we require that these subparts of P be patterns. Formally, a 
WG-Log rule r consists of two schemata S\ and 52, and a graph P. Si is 
called the source (schema) of r. 52 is a superschema* of 5i, and is referred 
to as the target (schema) of r. P must be a colored pattern over 52 such 
that Prsj Prs,rd and Prs,gs are patterns over 52- Figure 8 contains the 
colored pattern P of a rule, that has as source the schema of Figure 5, and as 
target the same schema, to which a Result-node and an in-edge are added. 
To define when an instance satisfies a rule, we need the notion of embedding. 
An embedding i of a pattern P = {Np^Ep) in an instance I = (Nj^Ej) is a 
total mapping i : Np Nj, such that for every node n in Np holds that 

• either \{i(n)) = \(n) or 

• There is a production (A(n),/5A, A(i(n))) in the target scheme; 

moreover, if n has a print label, then print{i{n)) = print{n). Also, if (n, a, n') 
is an edge in Ep, then (i(n),a,z(n')) must be an edge in Ej. Let P = (N,E) 
be a subpattern of the pattern P' and let I be an instance. An embedding j of 
P' in I is an extension of an embedding i oi P in I if i = j\N . An embedding i 
of P in / is constrained by P' if P' equals P or if there is no possible extension 
of i to an embedding of P' in I. We use the notion of “constrained” to express 
negation: an embedding is constrained by a pattern if it cannot be extended 
to an embedding of that pattern. Let r be a WG-Log rule with colored pattern 
P and target 52- An instance I over 52 satisfies r if every embedding Prs in 
I that is constrained by Prs,rd, can be extended to an embedding Prs^gs 
in L As we informally mentioned before, the instance of Figure 6 does not 
satisfy the rule of Figure 9. The only embedding i of Prs in I is constrained 
by Prs.rd (because it cannot be extended to an embedding of Prs.rd in /), 
and cannot be extended to an embedding of Prs.gs in /. To express complex 
queries in WG-Log, we can combine several rules that have the same source 
5i and target 52 in one WG-Log set So, a WG-Log set A is a finite set of WG- 
Log rules that work on the same schemata. 5i is called the source (schema) of 
A and 52 is its target (schema). The generalization of satisfaction to the case 
of WG-Log rule sets is straightforward. Let A be a WG-Log set with target 

*Sub- and superschema, sub- and superinstance, and sub- and superpattern are defined 
with respect to set inclusion. 
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5. An instance I over S satisfies A if I satisfies every rule of A. In WG-Log 
is also possible to use goals. A goal over a schema «S is a subschema of 5, and 
is used to select information of the Web site. Normally, a goal is combined 
with a query to remove uninteresting information from the resulting instance. 
The effect of applying a goal G over a schema S to an instance I over S is 
called I restricted to G (notation: I\G) and is the maximal subinstance of I 
that is an instance over G. The definition of satisfaction of a WG-Log set is 
easily extended to sets with goals. If A is a WG-Log set with target 52, then 
an instance / over G satisfies A with goal G if there exists an instance /' over 
52 such that /' satisfies A and 7'|G = /. 



4.2 WG-Log Programs and Semantics 

There is a strong connection between G-Log and first order predicate calculus. 
In (Paredaens et al 1995) it is shown that for every formula on a binary many 
sorted first order language there is an effective procedure that transforms it 
into an “equivalent” set of G-Log rules and a goal; the converse is trivially 
true. Hence, G-Log can be seen as a graphical counterpart of logic. WG-log is 
only a syntactic variant of G-log, whose semantics we want to retain in order 
to keep its expressive power and representation capability; thus the same 
correspondence holds for WG-log. Consider for instance the rule of Figure 9. 
This may be expressed in First Order Logic as follows: 

VmVpVa3resu/t : {createdJn{m,p) Aperiod{p, “Fenetian”)A 

Aname(a, ''Bibiena'') A -^created-by{m, a)} => SEL{result,m) 

Note that simpler languages like Datalog do not capture the whole expressive 
power of G-log: a Datalog rule is expressed in G-log by a simple rule containing 
red solid nodes and edges, and only one green edge. Thus, it is not possible to 
express the semantics of WG-log by translating it in Datalog. In the previous 

section we defined when an instance satisfies a WG-Log rule set; by examining 
the logical counterpart of WG-log, we get an intuition of the meaning of a 
WG-log rule; however, in order to use WG-Log as a query language we need 
to define its effect^ i.e. the way it acts on instances to produce other instances; 
only in this way we will be able to isolate, among the infinity of instances that 
satisfy a certain rule, the one we choose as the rule’s result. The semantics of 
a WG-Log set A with source 5i and target 52 is thus a binary relation over 
instances defined by: 

Sem{A) = {(/, J) I 1. / is an instance over 5i and J is an instance over 52, 

2. J satisfies A, 

3. J|5i = /, 

4. No subinstance of J satisfies conditions 1. to 3. 
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Item 3 expresses the requirement that in WG~Log we only allow queries, and 
no updates. If a WG-Log rule contains a red dashed and a green solid part, 
then it can be satisfied either by adding the red dashed part to an instance 
or by adding the green solid part. Because of item 3, the source schema can 
be chosen is such a way that only one (or even none) of the two extensions 
is allowed. In this way the semantics of the rule also depends on its source 
schema. Item 4 expresses minimality. In general there will be more than one 
minimal result of applying a WG-Log set to an instance, which corresponds 
to the fact that WG-Log is non-deterministic and Sem is a relation and not 
a function. In WG-Log, it is allowed to sequence sets of rules. A WG-Log 
program P is a finite list of WG-Log sets such that the target schema of each 
set of P equals the source schema of the next set in the program. The source 
schema of the first set is the source (schema) of F, and the target schema of 
the last set is the target (schema) of P. The semantics Sem{P) of a WG-Log 
program P = {Ai , . . . , is the set of pairs of instances (/i , ^n+l), such that 
there is a chain of instances / 2 , . . . , /n for which {Ij , Ij^i ) belongs to Sem{Aj ) , 
for all j. If a number of WG-Log rules are put in sequence instead of in one set, 
then, because minimization is applied after each rule, fewer minimal models 
are allowed. In fact, sequencing can be used to make a non-deterministic set of 
rules deterministic. Finally, a goal can be used in conjunction with a program. 
If 52 is the target of P and G is a goal over 52, then the semantics of P with 
goal G is: Sem{P,G) = { (/, J) | 3(7, J') G Sem{P) such that J'|G = J }. 
There are 3 complexity levels of constructions to express queries in WG-Log: 
rules, sets and programs, which all three can be used in conjunction with a 
goal. This results in the six cases stated in the table of Figure 10. The use 
of all the three complexity levels guarantees that WG-log is computationally 
complete (Paredaens et ai 1995), i.e., it can produce any desired superinstance 
of a given instance. Normally, one or two rules, together with a goal, are 
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WG-Log program 
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Figure 10 The complexity levels of WG-Log queries. 

sufficient to express most of the interesting queries we can pose to a Web site; 
however, some important queries do require the full language complexity. As 
an example, suppose we want to find all the pairs of nodes that are unreachable 
from each other by navigation; in other words, we want all the pairs that are 
not in the transitive closure of the relationship expressed by label SEL. An 
easy and natural way to solve this query is to compute the transitive closure 
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Figure 11 The complement of the navigational transitive closure. 




Figure 12 A Goal on the complement of the navigational transitive closure. 



stc of SEL, and then take the complement etc of that relation. The WG-Log 
program of Figure 11 solves this problem. It is a sequence of two sets of rules. 
The first set, which consists of two rules, adds stc-edges (logical) between all 
nodes that are linked by a SEL-path. The second set has only one rule and 
takes the complement of the transitive closure by adding a ctc-edge if there 
is no stc-edge. Eventually, a goal can be added to select only the node pairs 
that are linked by the etc (logical) relationship. 

Another interesting query might ask all the nodes that are not reachable 
from a specific one, for instance the page of the artist Bibiena; in this case, 
the program must be complemented by the goal of Figure 12. Note typically, 
such goals can be used to optimize computation; however, this is outside the 
scope of this paper. 



5 EVALUATION OF WG-LOG PROGRAMS 

In order to be able to express a rich set of queries, we have conceived WG- 
log as a language with a complex semantics; this gives rise to a computation 
algorithm that, in the general case, is very inefficient. However, in most cases 
the queries are expressed by only one or two rules, and possibly a goal which 
contributes to improving the efficiency of program computation. In the first 
subsection we present the computation algorithm in its most general form; 
later, we present an example of query computation, based on very simple 
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data structures, which gives the flavour of the real complexity the system 
will have to tolerate without any improvements. In future work we will study 
appropriate data structures, and optimizations based on the goal structure, 
which will offer the possibility of increasing the efficiency of the naive approach 
presented here. 



5.1 A general Computation Algorithm 

We now present the FastComp algorithm, that computes the result of a generic 
WG-Log set by using a kind of backtracking fixpoint technique. Suppose we 
are given a set of rules ^4 = { ri, . . . , } with source Si and target 52, and 
a finite instance / over Si . The procedure FastComp will try to extend / to 
an instance J, in such a way that J is finite and (/, J) G Sem{A), If this 
is impossible, it will print the message: “No solution”. FastComp calls the 
function Extend, which recursively adds elements to J until J satisfies A, or 
until J cannot be extended anymore to satisfy A. In this last case, the function 
backtracks to points where it made a choice among a number of minimal 
extensions and continues with the next possible minimal choice. If the function 
backtracks to its first call, then there is no solution. In this sense, FastComp 
reminds the “backtracking fixpoint” procedure that computes stable models 
(Sacca et al 1990). 

Procedure Fa$tComp{I , A, Si , S 2 ) 

J = I; 

if (Extend{J,A,Si,S 2 )) 

{ 

minimize(J); 
output (J); 

} 

else 

output (“No solution”); 

Function Extend{var JyA,Si,S 2 )) 

for (/ = 1, . . . , A;) (* the rules are ri,,..rk *) 

for (every embedding i of Pi^rs in J) 
if ( J does not satisfy r/ due to i) 

{ 

SetExt = (j)] 
it (Pl,RD ¥=<!>) 

for (every legal, minimal RD extension Ext of J) 

SetExt = SetExt U { Ext } ; 
for (every legal, minimal GS extension Ext of J) 

SetExt = SetExt U { Ext }; 




Structuring and querying Web sites 



43 



while {SetExt ^ (^) 

{ 

select Ext from SetExt] 
add Ext to J; 
if {Extend{J,A,Si,S2)) 
return (True); 
else 

remove Ext from J; 

SetExt = SetExt\{ Ext}] 

} 

return (False); 

} 

return (True); 

The algorithm uses the notion of “legal, minimal extension” of an instance. 
By legal, we mean that the extension may only contain nodes and edges not 
belonging to Si . Minimal indicates that no subpart of the extension is already 
sufficient to make the embedding under consideration extendible. We denote 
by FastComp{A) the set of all the pairs of instances (/, J), such that J is 
an output of the FastComp algorithm, for inputs I and A. In (Paredaens et 
al 1997) we proved that the FastComp algorithm is sound and finitely com- 
plete: 

FastComp(A) = FSem{A), for every WG-Log set A. Note that the complex- 
ity of FastComp is accounted for by the high expressive power of the language. 
The algorithm reduces to the standard fixpoint computation for those WG- 
Log programs that are the graphical counterpart of Datalog, i.e. sets of rules 
that consist of a red solid part and one green solid edge. Thus, efficiency of 
computation can easily be achieved for such programs, while optimization 
becomes more and more needed (and difficult) if more expressive queries are 
posed. 



5.2 An example of Rule Evaluation 

We shall now briefly comment on how FastComp can be used, at least in 
principle, to execute a WG-log query. Our sample query execution is based 
on three data structures: 

• the (Typed) Adiacency Matrix TAM of the instance graph. 

• the Instance Table IT linking schema entities and their instances 

• the URL list UL linking instances to HTML pages or other network objects. 

The role of the instance table is in many respects similar to that of the ontology 
introduced in (Luke et al 1997). Each entry of the adiacency matrix lists the 
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Figure 13 
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typed links (navigational, logical or coupled) connecting a pair of nodes. For 
the sake of conciseness, Fig. 13 does not show such a matrix, but a simpler, 
binary matrix where each entry i, j is 1 whenever instances i and j are linked 
by a navigational step, a logical relationships or both. Moreover, rows and 
columns pertaining to slots are not listed. Actually, we do not need to store 
slots in the TAM matrix; it is sufficient and surely less space-consuming to 
store them in auxiliary data structures pertaining to each single entity, in 
order to allow fast label matching. IT (Fig. 14) associates the unique code of 
each schema entity to a list of unique numbers called instance identifiers; this 
allows the Query Manager to trace instances of schema-defined entities in the 
instance graph. Finally, the URL list associates each instance identifier to one 
or more HTTP URLs. This means that in our approach an istance of an entity 
is not necessarily a single page, though this will probably be the most frequent 
case. With respect to the sample query in Fig. 9, asking for the monuments of 
the Venetian period whose author is not Francesco Bibiena, we remark that 
the initial values of FastComp parameters are as follows: the whole instance 
of Fig. 6, the single rule of Fig. 9 and the source schema of Fig. 5. The target 
schema for this query can be easily deduced from the rule and will therefore 
be omitted. 

To start with, IT is consulted to obtain the identifiers of the instance en- 
tities that match the rule entities. The following lists are obtained: Period 
= {1,2,3,4}, Monument = {8,10,13,16,17,18}. These lists are then used to 




Structuring and querying Web sites 



45 



Entity 


Instance 


A 


0 


B 


1,2,3,4 


C 


5,6,7 


D 


E,F,G 


£ 


8,10,13,16,17,18 


F 


9,11,12 


G 


14,15 



Figure 14 A sample instance table 



extract from the TAM the following possible adiacency information: 

1 17 14 

2 16 - 

3 18 - 

4 8 10 

An equality test on the labels leaves only two possible embeddings for the 
red solid part of the rule: 4, 8 and 4, 10. Now, we are ready to follow the tree 
of recursive calls of Extend for our sample FastComp execution. Luckily, the 
recursion depth turns out to be only four in this case. 

first call of Extend 

1st for iteration; embedding 4,8 does not satisfy rule 

Red dashed valid extensions: 5 (from TAM: the only instance of Venetian 
monument not related to Bibiena) 

Green solid valid extensions: Result 
SetExt = {5, Result} 

1st while iteration 
J = JU{5} 

second call of Extend 

1st for iteration; embedding 4,8 does not satisfy rule 
Red dashed valid extensions: none 
Green solid valid extensions: Result 
SetExt = {Result} 

1st while iteration 
J = J U {Result} 

3rd call of Extend 

1st for iteration; embedding: 4,8 does satisfy rule 
2nd for iteration; embedding: 4,10 does not satisfy rule 
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Red dashed valid extensions: none 
Green solid valid extensions: Result 
SetExt = {Result} 

1st while iteration 
J = J U {Result} 



4th call of Extend 

1st for iteration; embedding: 4,8 satisfies rule 
2nd for iteration; embedding: 4,10 satisfies rule 
Returns True (ends 4th call) 

Returns True (ends 3rd call) 

Returns True (ends 2nd call) 

Returns True (ends 1st call) 

The instance thus obtained is minimal, thus it is a solution to the query. 



6 CONCLUDING REMARKS AND FUTURE WORK 

Experience with current WWW search engines has shown that the availability 
of a database-like schema is a prerequisite for any effective Web query mech- 
anism. Though we are fully aware that the system described in this paper 
is only a preliminary step towards a satisfactory solution of the Web struc- 
turing and querying problem, we believe that its conceptual basis is sound 
and that its development may offer several interesting subjects for future re- 
search. For instance, a most important and promising issue is query execution 
itself, which must be both made more efficient and specialized to take into 
account the goal structure, schema information possibly available from a re- 
lational database underlying the site, and semantic properties of G-log, which 
enable the schema Robot to refuse a priori trivial or unsatisfiable queries. 
This is most needed since, as we have seen, W G-log retains the expressive 
power of the original G-log language: a carefully tuned execution mechanism 
is thus required to keep complexity in check (and to avoid “result explosion”) 
when dealing with those queries that involve some kind of transitive closure. 
Another critical topic is the presentation of results: here not only efficiency 
considerations are involved, but also problems concerning the heterogeneous 
quality of the information stored: where text, images, sound tracks and sim- 
ilar pieces of information must be arranged to be shown to the user in a 
coherent and understandable way, architect’s skills are needed, besides those 
of a Software designer. We plan to deal in the near future with querying fed- 
erate Web sites. Namely, we plan to allow Web users to formulate queries 
on the basis of several site schemata at once, extending our query execution 
mechanism to take into account links between distinct Web sites. Finally, we 
plan to address at a later time more difficult problems like (semiautomatic) 
schema deduction on the basis of instance inspection; schema integration over 
unrelated sites; schema update at instance evolution; effective treatment of 
instance and schema graphs when these assume huge proportions. 
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Abstract 

In the general framework of knowledge discovery, Data Mining techniques are 
usually dedicated to information extraction from structured databases. Text 
Mining techniques, on the other hand, are dedicated to information extrac- 
tion from unstructured textual data and Natural Language Processing (NLP) 
can then be seen as an interesting tool for the enhancement of information 
extraction procedures. In this paper, we present two examples of Text Min- 
ing tasks, association extraction and prototypical document extraction, along 
with several related NLP techniques. 

Keywords 

Text Mining, Knowledge Discovery, Natural Language Processing 



1 INTRODUCTION 

The always increasing importance of the problem of analyzing the large amounts 
of data collected by companies and organizations has led to important devel- 
opments in the fields of automated Knowledge Discovery in Databases (KDD) 
and Data Mining (DM). Typically, only a small fraction (5-10%) of the col- 
lected data is ever analyzed. Furthermore, as the volume of available data 
grows, decision-making directly from the content of the databases is not fea- 
sible anymore. 

Standard KDD and DM techniques are concerned with the processing of 
structured databases. Text Mining techniques are dedicated to the automated 
information extraction form unstructured textual data. 

In Section 2, we present the differences between the traditional Data Mining 
and the more specific Text Mining approaches, and in the subsequent sections, 
we describe two examples of Text Mining applications, along with the related 
NLP techniques. 



Data Mining and Reverse Engineering S. Spaccapieira & F. Maryanski (Eds.) 
© 1998 IFIP. Published by Chapman & Hall 
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2 TEXT MINING VS DATA MINING 

According to Fayyad, Piatetsky-Shapiro and Smyth (1996), Knowledge Dis- 
covery in Databases is Hhe non-trivial process of identifying valid, novel, po- 
tentially useful and ultimately understandable patterns in data\ and therefore 
refers to the overall process of discovering informations from data. However, 
as the usual techniques (inductive or statistical methods for building decision 
trees, rule bases, nonlinear regression for classification,...) explicitly rely on 
the structuring of the data into predefined fields, Data Mining is essentially 
concerned with information extraction from structured databases. 

Table 1 shows an example of Inductive Logic Programming based learn- 
ing from an attribute- value database (Dzerovski 1996). The presented tables 
contain the database and the rules induced by the mining process. 



Potential Customer Table 



Person 


Age 


Sex 


Income 


Customer 


Ann Smith 


32 


F 


10 000 


yes 


Joan Gray 


53 


F 


1 000 000 


yes 


Mary Blythe 


27 


F 


20 000 


no 


Jane Brown 


55 


F 


20 000 


yes 


Bob Smith 


50 


M 


100 000 


yes 


Jack ‘Brown 


50 


M 


200 000 


yes 



Married- To Table 



Husband 


Wife 


Bob Smith 


Ann Smith 


Jack Brown 


Jane Brown 



induced Rules 

if Income( Person) > 100 000 then Potential-Customer(Person) 
if Sex(Person) = F and Age(Person) > 32 
then Potential-Customer(Person) 

if Married(Person, Spouse) and Income( Person) > 100 000 
then Potential-Customer(Spouse) 

if Married (Person, Spouse) and Potential-Customer(Person) 
then Potential-Customer(Spouse) 



Table 1 An example of Data Mining using ILP techniques 
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This example illustrates how strongly the rule generation process relies on 
the explicit structure of the relational database (presence of well-defined fields, 
explicit identification of attribute- value pairs). 

In reality however, a large portion of the available information appears 
in textual and hence unstructured form (or more precisely in an implicitly 
structured form). Specialized techniques specifically operating on textual data 
then become necessary to extract information from such kind of collections of 
texts. These techniques are gathered under the name of Text Mining and, in 
order to discover and use the implicit structure {e.g. grammatical structure) 
of the texts, they may integrate some specific Natural Language Processing 
(used for example to preprocess the textual data). 

Text Mining applications impose strong constraints on the usual NLP tools. 
For instance, as they involve large volumes of textual data, they do not al- 
low to integrate complex treatments (which would lead to exponential and 
hence non tractable algorithms). Furthermore, semantic models for the appli- 
cation domains are rarely available, and this implies strong limitations on the 
sophistication of the semantic and pragmatic levels of the linguistic models. 

In fact, a working hypothesis (Feldman and Hirsh 1997) build upon the 
experience gained in the domain of Information Retrieval assumes that shallow 
representations of textual information often provides sufficient support for a 
range of information access tasks. 



3 ASSOCIATION EXTRACTION FROM INDEXED DATA 

If the textual data is indexed, either manually or automatically with the help 
of NLP techniques (such as the ones described in section 3.3), the indexing 
structures can be used as a basis for the actual knowledge discovery process. 

In this section, we present a way of finding information in a collection of 
indexed documents by automatically retrieving relevant associations between 
key- words. 



3.1 Associations : definition 

Let’s consider a set of key-words A = {u;i, tt; 2 , ..., and a collection of 
indexed documents T = {ti.h, - (Le. each is associated with a subset 
of A denoted ti{A)). 

Let W C A be a set of key- words, the set of all documents t in T such that 
W C t(A) will be called the covering set for W and denoted [W]. 

Any pair (VF, u;), where IF C A is a set of key-words and w E A\W, will 
be called an association rule, and denoted R : (W => w). 

Given an association rule R : {W w), 
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• S{R,T) = \[W U {w}]\ is called the support of R with respect to the 
collection T {\X\ denotes the size of X) 

• C{R,T) = called the confidence of R with respect to the 

collection T. 

Notice that C(i2,T) is an approximation (maximum likelihood estimate) 
of the conditional probability for a text of being indexed by the key-word 
w if it is already indexed by the key- word set W. 



An association rule R generated from a collection of texts T is said to satisfy 
support and confidence constraints cr and 7 if 

> (7 and CiR,T)>j 

To simplify notations, [W U {w}] will be often written [Ww] and a rule 
R : (W => w) satisfying given support and confidence constraints will be 
simply written as: 



W^w S{R,T)/C{R,T) 



3.2 Mining for associations 

Experiments of association extraction have been carried out by Feldman ei ai 
(1996) with the KDT (Knowledge Discovery in Texts) system on the Reuter 
corpus. The Reuter corpus is a set of 22173 documents that appeared on 
the Reuter newswire in 1987. The documents were assembled and manually 
indexed by Reuters Ltd. and Carnegie Group Inc. in 1987. Further formatting 
and data file production was done in 1991 and 1992 by David D. Lewis and 
Peter Shoemaker. 

The documents were indexed with 135 categories in the Economics domain. 
The mining was performed on the indexed documents only (i.e exclusively on 
the key- word sets representing the real documents). 

All known algorithms for generating association rules operate in two phases. 
Given a set of key-words A = {wi,W 2 ^ and a collection of indexed 

documents T = the extraction of associations satisfying given 

support and confidence constraints a and 7 is performed: 



• by first generating all the key-word sets with support at least equal to a 
{i.e. all the key- word sets W such that \[W]\ > a). The generated key- word 
sets are called the frequent sets (or <r-covers); 

• then by generating all the association rules that can be derived from the 
produced frequent sets and that satisfy the confidence constraint 7. 
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(a) Generating the frequent sets 

The set of candidate or-covers (frequent sets) is built incrementally, by starting 
from singleton cr-covers and progressively adding elements to a cr-cover as long 
as it satisfies the confidence constraint. 

The frequent set generation is the most computationally expensive step 
(exponential in the worse case). Heuristic and incremental approaches are 
currently investigated. 

A basic algorithm for generating frequent sets is indicated in Algorithm 1. 



i = l^Candi = {{«;}, where w are key-words; 

while (Candi do 

Candi^i = {5i US 2 I 5i,52 G Candi, 

and |5i U ^ 2 ! = i H- 1 

and V5 C 5i U 52, (|5i U 52 I i) ^ (5 G Candi) 
and |[5i U 52]| > cr} 

i = i-\- 1; 

endw 

Algorithm 1: Generating the frequent sets 



(b) Generating the associations 

Once the maximal frequent sets have been produced, the generation of the 
associations is quite easy. A basic algorithm is presented in Algorithm 2. 



foreach W maximal frequent set do 

generate all the rules VT\{u;} {u;}, where w £ W, such that 

MhIII 

Itivll - 

endfch 



Algorithm 2: Generating the associations 



(c) Examples 

Concrete examples of associations rules found by KDT on the Reuter Corpus 
are provided in Table 2. These associations were extracted with respect to 
specific queries expressed by potential users. 
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query : ‘find all associations between a set of countries 

including Iran and any person^ 
result : [Iran, Nicaragua, Usa] ^ Reagan 6/1.000 

query : ‘find all associations between a set of topics 

including Gold and any country' 
result : [gold, copper] Canada 5/0.556 

[gold, silver] USA 18/0.692 



Table 2 Examples of associations found by KDT 



3.3 NLP techniques for association extraction: Automated 
Indexing 

In the case of the Reuter Corpus, document indexing has been done manually, 
but, as manual indexing is a very time-consuming task, it is not realistic 
to assume that such a processing could systematically be performed in the 
general case. Automated indexing of the textual document base, performed 
for example in a preprocessing phase, has to be considered in order to allow 
the use of association extraction techniques on a large scale. 

Techniques for automated production of indexes associated with documents 
can be borrowed from the Information Retrieval field. In this case, they usually 
rely on frequency-based weighting schemes (Salton and Buckley 1988). Several 
examples of such weighting schemes are provided in the SMART Information 
Retrieval system. Formula (1) presents the SMART ate weighting scheme. 



0'5x(l + SifSko)l<>8(l7) 

0 otherwise 



where is the weight of word wj in document t,-, pij is the relative 
document frequency of Wj in ti {pij = /j jt, where /j ^ is the number 

of occurrences of Wj in tj), N is the number of documents in the collection 
and Tij is the number of documents containing Wj . 

Once a weighting scheme has been selected, automated indexing can be 
performed by simply selecting, for each document, the words satisfying given 
weight constraints. 



The major advantage of automated indexing procedures is that they dras- 
tically reduce the cost of the indexing step. One of their main drawbacks is 
however that, when applied without additional knowledge (such as a the- 
saurus), they produce indexes with extremely reduced generalization power 




56 



Part One Invited Talks 



(key- words have to be explicitly present in the documents, and do not always 
provide a good thematic description). 



3.4 Additional issues 

(a) Integration of background knowledge 

If background knowledge is available (for example some factual knowledge 
about the application domain), additional constraints can be integrated in 
the association generation procedure (either in the frequent set generation, or 
directly in the association extraction). An example of a system using back- 
ground knowledge for association generation is the FACT system developed 
by (Feldman and Hirsh 1996). 

(b) Generalization of the notion of association 

Several generalizations are possible for the notion of association (rule): 

• rules with more than one key- word in their right-hand side, which can 
express more complex implications; 

• more general attributes {i.e. not only restricted to key- words presence / ab- 
sence): discrete and continuous variables; 

• non implicative relations, such as pseudo-equivalences; 

• different quality measures providing alternative approaches for confidence 
evaluation. 

An example of system integrating such kinds of generalizations is the GUHA 
system developed at the Institute of Computer and Information Science in 
Prague. 



4 PROTOTYPICAL DOCUMENT EXTRACTION FROM FULL 
TEXT 

The association extraction presented in the previous section exclusively oper- 
ates on the document indexes, and therefore does not directly take advantage 
of the textual content of the documents. Approaches based on full text mining 
for information extraction can then be considered. 

Our initial experiments on the Reuter Corpus (Rajman and Besangon 1997) 
were dedicated to the implementation and evaluation of association extraction 
techniques operating on all the words contained in the documents instead of 
only considering the associated key- words. The obtained results showed how- 
ever that association extraction based on full text documents does not pro- 
vide effectively exploitable results. Indeed, the association extraction process 
either just detected compounds, i.e. domain-dependent terms such as [wall] 




Language techniques for text mining applications 



57 



street or [treasury secretary james] baker, which cannot be considered 
as ‘potentially usefuV (referring to the KDD definition given in section 2) or 
extracted uninterpretable associations such as [dollars shares exchange total 
commission stake] => securities, that could not be considered as ‘ultimately 
understandable'. 

We therefore had to seek for a new TM task that would be more adequate 
for full text information extraction out of large collections of textual data. 
We decided to concentrate on the extraction of prototypical documents, 
where ‘prototypical’ is informally defined as corresponding to an information 
that occurs in a repetitive fashion in the document collection. The underlying 
working hypothesis is that repetitive document structures provide significant 
information about the textual base that is processed. 

Basically, the method presented in this section relies on the identification of 
frequent sequences of terms in the documents, and uses NLP techniques such 
as automated Part-of-Speech Tagging and Term Extraction to preprocess the 
textual data. 

The NLP techniques can be considered as an automated generalized in- 
dexing procedure that extracts from the full textual content of the documents 
linguistically significant structures that will constitute a new basis for frequent 
set extraction. 



4.1 NLP Preprocessing for prototypical document 
extraction 

(a) Part-Of-Speech tagging 

The objective of the Part-Of-Speech tagging (POS-Tagging) is to automat- 
ically assign Part-of-Speech tags (i.e. morpho-syntactic categories such as 
noun, verb, adjective,...) to words in context. For instance, a sentence as ‘a 
computational process executes programs’ should be tagged as ‘a/DET com- 
putational/ AD J process/N executes/V programs/N’. The main difficulty of 
such a task is the lexical ambiguities that exist in all natural languages. For 
instance, in the previous sentence, both words ‘process’ and ‘programs’ could 
be either nouns(N) or verbs(V). 

Several techniques have been designed for POS-tagging: 

• Hidden Markov Model based approaches (Cutting et al. 1992); 

• Rule-based approaches (Brill 1992); 

If a large lexicon (providing good coverage of the application domain) and 
some manually hand-tagged text are available, such methods perform auto- 
mated POS-tagging in a computationally very efficient way (linear complex- 
ity) and with a very satisfying performance (on the average, 95-98% accuracy). 
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One of the important advantage of POS-tagging is to allow automated 
filtering of non-significant words on the basis of their morpho-syntactic cate- 
gory. For instance, in our experiments (where we used the E. Brill’s rule-based 
tagger (Brill 1992)), we decided to filter out articles, prepositions, conjunc- 
tions,... therefore restricting the effective mining process to nouns, adjectives, 
and verbs. 



(b) term extraction 

In order to automatically detect domain-dependent compounds, a term ex- 
traction procedure has been integrated in the preprocessing step. 

Automated term extraction is indeed one of the critical NLP tasks for vari- 
ous applications (such as terminology extraction, enhanced indexing...) in the 
domain of textual data analysis. 

Term extraction methods are often decomposed into two distinct steps 
(Daille 1994): 



• extraction of term candidates on the basis of structural linguistic informa- 
tion; for example, term candidates can be selected on the basis of relevant 
morpho-syntactic patterns (such as ‘N Prep N’: board of directors, Secre- 
tary of State,...; ‘Adj N’: White House, annual rate,...; etc); 

• filtering of the term candidates on the basis of some statistical relevance 
scoring schemes, such as frequency, mutual information, coefficient, log- 
like coefficient,...; in fact, the actual filters often consist of combinations 
of different scoring schemes associated with experimentally defined thresh- 
olds. 



In our experiments, we used 4 morpho-syntactic patterns to extract the 
term candidates: 'Noun Noun’ (1), 'Noun of Noun’ (2), 'Adj Noun’(3), 'Adj 
Verbal’(4). In order to extract more complex compounds such as 'Secretary 
of State George Shultz’, the term candidate extraction was applied in an it- 
erative way where terms identified at step n were used as atomic elements 
for step n -j- 1 until no new terms were detected. For example, the sequence 
'Secretary/N of/prep State/N George/N Shultz/N’ was first transformed into 
' Secret ary-of-State/N George-Shultz/N’ (patterns 2 and 1) and then com- 
bined into a unique term ' Secret ary-of-State-George-Shultz/N’ (pattern 1).A 
purely frequency-based scoring scheme was then used for filtering. 

The prototype integrating POS-tagging and term extraction that we used 
for our experiments was designed in collaboration with R. Feldman’s team at 
Bar Ilan University. 
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4.2 Mining for prototypical documents 
(a) The extraction process 

The extraction process can be decomposed into four steps: 

• NLP preprocessing: POS-tagging and term extraction, as described in the 
previous section; 

• frequent term sets generation using an algorithm globally similar to the one 
described in Algorithm 1 (with some minor changes, particularly concerning 
the data representation); 

• clustering of the term sets based on a similarity measure derived from the 
number of common terms in the sets; 

• actual production of the prototypical documents associated with the ob- 
tained clusters. 

The whole process is described in more detail in subsection (b), on the basis 
of a concrete example. 

As we already mentioned earlier, association extraction from full text docu- 
ments provided uninterpretable results, indicating that associations constitute 
an inadequate representation for the frequent sets in the case of full text min- 
ing. In this sense, the prototypical documents are meant to correspond to 
more operational structures, giving a better representation of the repetitive 
documents in the text collection and therefore providing a potentially useful 
basis for a partial synthesis of the information content hidden in the textual 
base. 



(b) example 

Figure 1 presents an example of a (SGML tagged) document from the Reuter 
Corpus. 

Figure 2 presents, for the same document, the result of the NLP preprocess- 
ing step (POS-tagging and term extraction: the extracted terms are printed 
in boldface). 

During the production of term sets associated with the documents, filtering 
of non-significant terms is performed, on the basis of: 

• morpho-syntactic information: we only keep nouns, verbs and adjectives; 

• frequency criteria: we only keep terms with frequency greater than a given 
minimal support; 

• empiric knowledge: we remove some frequent but non-significant verbs (is, 
has, been,...). 



After this treatment, the following indexing structure (term set) is obtained 
for the document and will serve as a basis for the frequent set generation: 
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<REUTERS NEWID=”2088”> 

(...) 

<BODY>Nissan Motor Co Ltd <NSAN.T> is issuing a 35 billion yen eurobond 
due March 25 1992 paying 5-1/8 pet and priced at 103-3/8, Nikko Securities Co 
(Europe) Ltd said. 

The non-callable issue is available in denominations of one min Yen and will be 
listed in Luxembourg. 

The payment date is March 25. 

The selling concession is 1-1/4 pet while management and imderwriting combined 
pays 5/8 pet. 

Nikko said it was still completing the syndicate. </BODY></TEXT> 

< /REUTERS > 



Figure 1 An example of Reuter Document 



<DOC2088> 

Nissan_Motor_Co_Ltd/N “/“ NSAN/N ./. T/N ”/” is/V issuing/ V a/DET 
35_billion_yen/CD eurobond/V due/ADJ March_25/CD 1992/CD pay- 
ing/V 5-l/8_percent/CD and/CC priced/V at/PR 103-3/8/ADJ ,/, 

Nikko_Securities_Co/N (/( Europe/N )/SYM Ltd/N said/V ./. 

The/DET non-callable Jssue/N is/V available/ AD J in/PR denominations/N 
of/PR one_million/CD Yen/CD and/CC will/MD be/V listed/ V in/PR Lux- 
embourg/N ./. 

The/DET payment jdate/N is/V March_25/CD ./. 

The/DET selling_concession/N is/V l-l/4_percent/CD while/PR manage- 
ment/N £ind/CC underwriting/N combined/V pays/V 5/8_percent/CD ./. 
Nikko/N said/V it/PRP was/V still/RB completing/V the/DET syndicate/N ./. 



Figure 2 A tagged Reuter Document 



{ available/ adj combined/ v denominations/n due/adj europe/n issuing/ v listed/ v 
luxembourg/n management/n paying/v payment jdate/n pays/v priced/v sell- 
ing .concession /n syndicate/n underwriting/n} 

The frequent sets generation step (of course operating on the whole doc- 
ument collection) then produces, among others, the following frequent term 
sets (POS-tags have been removed to increase readability): 



{due available management priced issuing paying denominations underwriting} 86 

{due available management priced issuing denominations payment jdate} 87 

{due available meinagement priced issuing denominations underwriting luxembourg} 81 

{due management seUing priced issuing listed} 81 

{due priced issuing combined denominations payment jdate} 80 

{management issuihg combined underwriting payment jdate} 80 

(...) 

where the numeric values correspond to the frequency of the sets in the 
collection. 

In order to reduce the important information redundancy due to partial 
overlapping between the sets, clustering was performed to gather some of the 
term sets into classes (clusters), represented by the union of the sets: 
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{due available management priced issuing combined denominations listed underwriting 
luxembourg payment jiate paying} 45 

To reduce the possible meaning shifts linked to non corresponding word 
sequences, the term sets representing identified clusters were split into sets of 
distinct terms sequences associated with paragraph boundaries in the original 
documents. The most frequent sequential decompositions of the clusters are 
then computed and some of the corresponding document excerpts extracted. 
These document excerpts are by definition the prototypical documents corre- 
sponding to the output of the mining process. 

Figure 3 presents both the most frequent sequential decomposition for the 
previous set and the associated prototypical document. 



(issuing due paying priced) (available denominations listed luxembourg) (pay- 
ment jdate) (management underwriting combined) 41 

<DOC 2088 > 

Nissan_Motor_Co_Ltd “NSAN.T” is issuing a 35_billion_yen eurobond due 
March-25 1992 paying 5-l/84>ercent and priced at 103-3/8, Nikko_Securities_Co 
( Europe ) Ltd said. 

The non-callableJssue is available in denominations of one_million Yen and will 
be listed in Luxembourg. 

The paymentjdate is March_25. 

The sellingxoncession is l-l/44)ercent while management and underwriting 
combined pays 5/84>ercent. 

Nikko said it was still completing the syndicate. 



Figure 3 An example of prototypical document 



4.3 Future Work 

(a) Name entity tagging 

We performed syntactic Part-Of-Speech tagging on the document base. Sim- 
ilar techniques can also be used for semantic tagging. 

For instance, the Alembic environment, developed by the MITRE Natural 
Language Processing Group (MITRE NLP Group 1997), correspond to a set 
of techniques allowing rule based name entity tagging. The rules used by the 
system have been automatically learned from examples. 

Figure 4 presents the prototypical document given in the previous section, 
as tagged by Alembic. This tagging has been provided by Christopher Clifton, 
from MITRE, and used two rule bases, trained to respectively recognize per- 
son/location/organization, and date/time/money/numbers. 

This kind of semantic tagging will be undoubtfully useful for the generaliza- 
tion of the variable parts in prototypical documents, and could be considered 
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<s> <ENAMEX TYPE=ORGANIZATION>Nissan Motor Co Ltd</ENAMEX> 
”<ENAMEX TYPE=ORGANIZATION>NSAN</ENAMEX>.</sXs>T” is is- 
suing a <NUMBER>35</NUMBER> <NUMBER>biUion</NUMBER> yen 
eurobond due <TIMEX TYPE=DATE>March 251992</TIMEX> paying 
<NUMBER>5</NUMBER>-<NUMBER>l/8</NUMBER> percent and priced 
at <NUMBER>103</NUMBER>-<NUMBER>3/8</NUMBER> , <ENAMEX 
TYPE=ORGANIZATION>Nikko Securities Co</ENAMEX> ( <ENAMEX 
TYPE=LOCATION>Europe</ENAMEX> ) Ltd said.</s> 

<s>The non-callable issue is available in denominations of 
<NUMBER>one</NUMBER> <NUMBER>million</NUMBER> <ENAMEX 
TYPE=ORGANIZATION>Yen</ENAMEX> and wiU be listed in <ENAMEX 
TYPE=LOCATION>Luxembourg</ENAMEX>.</s> 

<s>The payment date is <TIMEX TYPE=DATE>March 25</TIMEX>.</s> 
<s>The selling concession is <NUMBER>1</NUMBER>- 
<NUMBER>l/4</NUMBER> percent while management and underwriting 
combined pays <NUMBER>5/8</NUMBER> percent. </s> 

<sXENAMEX TYPE=ORGANIZATION>Nikko</ENAMEX> said it was still 
completing the syndicate. </s> 



Figure 4 Name entity tagging of a Prototypical Document 



as an abstraction process that will provide a better representation of the syn- 
thetic information extracted from the base. 

(b) Implicit user modeling 

In any information extraction process, it is of great interest to try to take into 
account an interaction with the user. Experiments in Information Retrieval 
(IR) have shown for instance that better relevance results can be obtained 
by using relevance feedback techniques (techniques that allow to integrate 
relevance evaluation by the user of the retrieved documents). 

In our model, such an approach could lead to integrate both a posteriori 
and a priori information about the user, and therefore correspond to the 
integration of an implicit model of the user. 

• A posteriori information could be obtained, with a similar procedure as 
in classical IR processes, through the analysis of the reactions of the user 
concerning the results provided by the TM system (relevance or usefulness 
of extracted prototypical documents). 

• A priori information could be derived, for example, from any pre-classifica- 
tion of the data (often present in the real data: for example, users often 
classify their files in directories or folders). This user pre-partitioning of the 
document base contain interesting information about the user and could 
serve as a basis for deriving more adequate parameters for the similarity 
measures (for instance, the parameters could be tuned in order to minimize 
inter-class similarity, and maximize intra-class similarity). 
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5 CONCLUSION 

The general goal of Data Mining is to automatically extract information from 
databases. Text Mining corresponds to the same global task but specifically 
applied on unstructured textual data. In this paper, we have presented two 
different TM tasks: association extraction from a collection of indexed docu- 
ments, designed to answer specific queries expressed by the users, and proto- 
typical document extraction from a collection of full-text documents, designed 
to automatically find information about classes of repetitive document struc- 
tures that could be used for automated synthesis of the information content 
of the textual base. 
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Abstract 

The BRAINNE method is a technique for extracting symbolic production rules, 
concepts and concept hierarchies from neural networks. Previous work reported on 
the extraction of conjunctive and disjunctive rules for the case of binary, discrete 
and continuous features. In this paper we explain three new improvements. The 
first improvement uses a hybrid two layer network for learning disjunctive rules, 
where the first layer consists of a network that uses unsupervised learning and the 
second layer uses supervised learning. The second improvement examines the 
effect on the generdisation capability of the rules developed by avoiding 
overtraining of the neural network, and the third development is an improved 
approach to dealing with continuous features which greatly increases the 
generalisation capability of the derived rules. 
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Neural networks, hybrid systems, symbolic knowledge extraction 
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1 INTRODUCTION 

Neural networks can be distinguished by the type of learning employed. 
Supervised learning involves data sets with input patterns and their related targets 
or outputs. The corresponding target output values are available to guide the 
learning. A single layer perceptron trained with the delta rule is a supervised 
network. Unsupervised learning aims to detect underlying statistical patterns in the 
input data. For example, the self organising Kohonen network groups into clusters 
the output nodes that are topologically close to each other and that respond in a 
similar manner to the training set. 

One criticism of neural networks for datamining is that the knowledge embedded 
in these subsymbolic systems is not easy for humans to comprehend. It is a 
worthwhile aim to transform the cognitive black box neural network into a 
cognitive clear box, if required. In other words, if the subsymbolic knowledge 
representation of neural networks can be mapped into a useful, understandable 
symbolic form then one of the constant criticisms of neural networks will no longer 
be relevant. 

In the literature, one finds several methods for extracting propositional 
conjunctive rules (Fu, 1991; Gallant, 1993; Saito & Nakano, 1988; Towell & 
Shavlik, 1993). These essentially view mle extraction as a form of breadth first 
search. One difficulty here is that the computational complexity is exponential in 
the features. In addition, they primarily identify a conjunctive rule for each output. 
An important departure from this is the approach of Craven and Shavlik (1993). 
Their M of N method derives rules in the form: 

IF (M of the following N antecedents are true) THEN .... 

If M = N then only conjunctive rules are found, and if M < N then a number of 
disjunctive rules are formed. They limit their consideration to networks that have 
discrete output classes and input features that have Boolean or nominal values. 

Sestito and Dillon (1989, 1990a-c, 1991a-b, 1992, 1993, 1994) detail the research 
that has lead to the BRAINNE (Building Representations for AI using Neural 
NEtworks) method. For supervised networks, BRAINNE can successfully extract 
production rules from domains with both symbolic discrete attributes and 
continuous numeric attributes. A method of extracting rules, concepts and 
hierarchies has also been developed for data where no target outputs are given 
using unsupervised BRAINNE (Dillon et al., 1993, 1994; Sestito & Dillon, 1994). 

There are two types of production rules which may be extracted; conjunctive and 
disjunctive. When the antecedent X consists of premises x,, X 2 , . . x„ joined by the 
A (AND) connective, the rule is conjunctive: 

X, A Xj A ... A x„ => C. 

Such rules require all premises to hold for the conclusion to be true. 

Disjunctive rules arise when more than one set of conditions lead to the same 
conclusion. The disjunctive rule 
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X, V X2 V . . . V x„ => C 

comprising groups of premises joined by the v (OR) connective, is satisfied when 
at least one premise is true. Note that the premises in a disjunctive rule may 
themselves contain several conjunctive rules. 

Domains requiring disjunctive rules have two or more different sets of defining 
attributes for the one concept. The underlying strategy in Supervised BRAINNE is 
to divide the examples in the training set into subsets or partitions that can be 
uniquely described by one conjunctive mle. This approach requires retraining of a 
modified supervised learning network at each partition. This paper presents work 
on extracting disjunctive rules by a method which, unlike supervised BRAINNE, 
does not require resolving several modified networks. Both conjunctive and 
disjunctive rules are directly obtained from Hybrid BRAINNE. 

BRAINNE is designed to deal with both discrete and continuous data. For 
continuous data, a dynamically chosen threshold value is used to produce bounds 
for each of the attributes in a conjunctive rule. This paper also examines several 
improvements to BRAINNE to increase its generalisation capability by: 

• using a test set to stop the network from overtraining i.e. memorising; 

• defining improved bounds for continuous data. 

2 SUPERVISED BRAINNE 

2.1 Brief overview of BRAINNE 

In this paper, we present a brief overview of the BRAINNE approach for extracting 
rules and concepts. As a precursor to BRAINNE, the extraction of production 
rules, higher level concepts and a concept hierarchy from a fully connected, single 
layer network trained with Hebb's Rule was investigated. Sestito and Dillon (1994) 
explain that the weights in the weight matrix indicate the contribution each input 
makes to each output. Thus, the input with the largest weight link to an output 
makes the largest contribution to that output. This forms the basis for extracting a 
conjunctive rule for each output using the contributory inputs or attributes in the 
antecedent. Higher level concepts or specific combinations of defining attributes 
can be extracted from the initial set of rules. A concept hierarchy can be formed, if 
possible, with higher lying concepts representing generalisations and lower lying 
concepts representing specialisations (Sestito & Dillon, 1994). The method 
requires the repetition of the following steps until all rules are reduced to 
tautologies (i.e. the concepts imply themselves) (Sestito & Dillon, 1994): 

• Select the rules with the least number of contributory attributes. 

• Insert these concepts into the rule base by replacing all combinations of the 
attributes that define a concept by the concept itself, in any of the remaining 
rules. 
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This rule extraction approach has some serious limitations. The modelling capacity 
of single layer neural networks does not extend to complex, nonlinear domains, 
and the rules produced are conjunctive (Sestito & Dillon, 1994). Domains exist 
which have two or more different sets of defining attributes for the one concept, 
requiring a disjunctive rule. 

22 Supervised BRAINNE and conjunctive rules 

To extract conjunctive rules for each output, BRAINNE uses a multi-layered 
neural net to determine the relevance of inputs and a single layer neural net to 
determine the irrelevance of inputs to a particular output. A summary of the steps 
in Sestito and Dillon (1990a-c, 1994) to extract conjunctive rules follows. 

• Train a modified, multi-layered neural network with one hidden layer using 
Back-propagation. The modified net has an extended input set consisting of 
the original inputs and the desired outputs as additional inputs. 

• Determine the contribution of an original input to an output using the Sum of 
Squares Error (SSE) measurement: 

SSEab = I(Wbj-Waj)2 

j 

where a is an input attribute, b is an additional input (desired output), and 
Waj and Wbj are the weight links between a and hidden unit j, and b and 
hidden unit j, respectively. The smaller the SSE value, the greater the 
contribution of the original attribute to the output. This is based on the 
premise that there is a strong association between an input and an additional 
input (desired output) if their weight links to each hidden unit are similar. 

• Negate the original inputs and use these as inputs to a single layer neural 
network with the desired outputs as its outputs. Train the net using Hebb’s 
Rule to obtain a measure of irrelevance of the original inputs to the outputs. 
The smaller the weight value, the greater the contribution of the original 
attribute to the output. 

• Calculate the product of the inhibitory weight values and the SSE 
measurements for all combinations of the inputs a and the outputs b. The 
smaller the product value, the greater the contribution of the original attribute 
to the output. 

• For each output, sort the product list from maximum to minimum. Select 
those input attributes below some clear cut-off point in the product list as the 
contributory attributes in the antecedent of the corresponding conjunctive 
rule. A clear cut-off occurs if one of two consecutive products is 2 or 3 times 
larger than the other. 

• From this initial set of rules, extract a concept hierarchy, if required. 
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2.3 Supervised BRAINNE and disjunctive rules 

Two difficulties can be encountered in the process described in Section 2.2. The 
constructed conjunctive rule may not be unique, or the sorted product list may have 
no clear cut-off point. In both these cases, the procedure for extracting disjunctive 
rules is required for that output. 

The underlying strategy is to divide the examples in the training set of the 
particular output into subsets that can be uniquely described by one conjunctive 
rule. BRAINl^ initially splits the examples of the output into subsets using the 
attribute with the smallest nonzero product value. Note that any attribute with a 
zero product value is automatically selected as well. Two initial rules are formed. 
For example, if the sorted products list for OUTPUTl has attribute al with a zero 
value and attribute a2 with a value of 0.1 as the next largest attribute, then a2 is 
selected as the splitting attribute and al is automatically selected. Two subsets are 
formed for the output and the two corresponding initial rules are: 

al AND a2 => (OUTPUT 1) 
al AND NOT a2 => (OUTPUT 1). 

If an initial rule covers no examples it is discarded. Also a check is made to see if 
an initial rule is stopped from further expansion. 

A new modified neural net is constructed with the initial rules as the only 
outputs. The input set consists of the original inputs and the initial rules as 
additional inputs. When a rule is used as an additional input in BRAINNE, the 
input is set to 1 if the example being applied to the net is covered by the rule. 
Otherwise it is set to 0. Sestito and Dillon (1991b, 1994) describe a cyclic guided 
generate-and-test procedure to extract the disjunctive rule for an output. Note in 
the above, after each partition a new modified neural network has to be constructed 
and trained. 

The full detailed algorithm using this approach is given in Sestito and Dillon 
(1991b, 1994). 

2.4 Supervised BRAINNE and continuous data 

The form of the input in the BRAINNE method so far described is restricted to 
categorical or linear discrete data. The production rules determined are defined by 
lists of attributes. This makes the process of testing whether an example is covered 
by a rule straightforward as (Sestito & Dillon, 1994): 

• if all the attributes defining a rule are present in an example, then the 
example is covered by the rule; otherwise it is not covered. 

If continuous data is used in systems that require discrete inputs, the solution often 
adopted is to break the continuous data up into two or more classes or bins by 
selecting an appropriate number of threshold values. Disadvantages of this 
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approach include an increase in the number of input attributes and the arbitrary 
nature of the choice of the number of thresholds and their values (Sestito & Dillon, 
1994). One of the positive features of neural networks is their ability to handle 
continuous data directly without any preprocessing. 

BRAINNE uses the normalised values of continuous attributes directly, with one 
original input for each continuous attribute. This requires an approach to 
determining if a rule covers an example. A rule is a list of attributes, whereas a 
continuous attribute in an example is a normalised real number. A method for 
comparing these different representations is needed when determining the coverage 
of a continuous example by a rule (Sestito & Dillon, 1992, 1994). 

When a rule is used as an additional input in BRAINNE, its value is set to 1 if the 
example being applied to the net is covered by the rule. Otherwise it is set to 0. To 
determine coverage for an example with any continuous attributes among its 
features, the following statistical measure calculating the closeness of the example 
to the rule is used (Sestito & Dillon, 1994): 

SumDifference = X (R:act:attri - E:act:attri) ^ 

where i ranges over all the attributes in the rule (continuous and discrete), 

R:act:attri is the value of the attribute in the rule, and 
E:act:attri is the value of the attribute in the example. 

If attrC is a continuous attribute that is present in the antecedent of a constructed 
rule in BRAINNE, it is assigned the value 1 .0 in the rule. Similarly, if NOT attrC is 
in the antecedent, it is assigned the value 0.0. If the SumDifference is less than or 
equal to a small threshold t, the rule covers the example. During the operation of 
continuously BRAINNE, this threshold is dynamically checked and increased if 
necessary to ensure that the appropriate number of examples is classified. 

The continuous and discrete defining attributes for each rule have been 
determined by the process described earlier but the possible range of each 
continuous attribute in each rule is still required. First, all the examples are tested 
via the SumDifference to obtain the example set covered by the rule. Then the 
range for each continuous attribute in the rule is obtained directly from the 
example set. The minimum MinC and maximum MaxC of attrC, say, over the 
example set determine the bounds of attrC in the corresponding clause in the rule’s 
antecedent: 

(MinC < attrC < MaxC). 

If NOT attrC were in the rule, exactly the same procedure is followed to obtain a 
clause in identical format to the one above, with an appropriate range. 

BRAINNE reclassifies the examples using the final form of the rules with the 
ranges as determined directly from the data. Several positive features of this 
approach should be mentioned. The calculation of ranges directly from the data 
avoids user selection of the ranges. The user may have no idea of appropriate 
ranges. Also, as the ranges are determined separately for each rule, classification 
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precision is enhanced, and the question of coverage of an example by a rule is 
simply answered. 

3 IMPROVING THE EXTRACTION OF DISJUNCTIVE RULES 
USING A HYBRID NEURAL NETWORK 

Let us consider the case of an output class 0^ that consists of two disjunctive 
subclasses, and 0^, each of which is defm^ by a separate mle, R1 and R2 
respectively. Thiese rules have the form: 

R1:IFA, THENO,, 

R2: IF A, THEN O^. 

Class O, would then be defined by the disjunction of the antecedents to give: 

IFA, vA^ THEN O,. 

A, and A 2 could be composite premises involving the conjunction of several 
attributes (a„, ..., aj and (aj,, ..., respectively. 

All the attributes (a„, ..., aj„, aj,, ..., ^^at are part of the disjunctive rule 
would be forced to have a strong link to the output Class Oj during training by a 
supervised network. It is difficult to distinguish which of the attributes are a 
conjunctive group (say aj) and which groups are disjunctive. This led to the need 
for the technique described in the previous section involving multiple resolving of 
a neural network to separate out groups of attributes that are conjunctive with each 
other and disjunctive with other groups. 

An unsupervised network essentially identifies combinations of variables that 
occur together frequently and allocates them to a class. In the case of the above 
disjunctive rule, it would group the features (a„, ..., aJ to a cluster which we 
could identify as the subclass O^,, and the features (a 2 ,, ..., to a cluster which 
we could identify as the subclass O^. These clusters corresponding to O,, and 
would not be identified by the unsupervised approach as belonging to the same 
output class O, as there is no information about the target output classes. 

If this information were subsequently used with a supervised network, we could 
distinguish that the clusters 0„ and 0^2 are disjunctive subclasses of the output 
class O,. It was felt, therefore, that a hybrid approach which combines supervised 
and unsupervised learning would be essential to identify disjunctive rules directly. 
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Figure 1 Hybrid neural network. 

A hybrid network consisting of two separate layers is required (Figure 1). The 
Kohonen model (Kohonen, 1990) is used as the unsupervised network in layer 1 
due to its simple structure and its ability to extract inherent statistical features. The 
output from layer 1 is used as the input for layer 2. Layer 2 consists of a supervised 
network using a single layer perceptron model (McClelland & Rumelhart, 1988). 
Once layer 1 is trained on the information, it is analysed using the Unsupervised 
BRAIN^ algorithm explained in Section 3.1. This analysis process generates a 
set of production rules with consequents (Cj, Cj, ...» CJ corresponding to clusters 
formed in the Kohonen output layer. 

The layer 2 supervised network is then stimulated with the outputs from the 
Kohonen layer. Once trained, it is used to classify the clusters into different output 
classes. The example data is really data suitable for supervised training in that, for 
each example, the values of the inputs as well as the associated target output class 
are given. Hence the number of output classes is known. Further, we know that a 
given example corresponds to a given class. Thus, this process replaces the 
symbolically denoted consequents of the production rules generated by the first 
unsupervised layer with known output classes. Importantly, this automatically 
picks up the disjunctive classes. 

The algorithm therefore is as follows: 

1. Apply Supervised BRAINNE (Sestito & Dillon, 1994). If only unique 
conjunctive rules are indicated, generate these rules and stop. If not, go to 
step 2. 

2. Train the unsupervised layer 1 on the data and use Unsupervised BRAINNE 
to generate production rules with symbolically labelled clusters as the 
consequents of these rules. 

3. Run the data through layer 1 and train the supervised network. Replace the 
virtual labels assigned in step 2 with known output classes. 

The method of rule extraction using Unsupervised BRAINNE employed in step 2 
is explained in Section 3.1. The method used in step 3 is explained in Section 3.2. 
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3.1 The unsupervised layer 

Unsupervised BRAINNE is a method for extracting production rules that define 
the clusters in the output layer of a trained Kohonen network. The following is an 
outline of the procedure (Sestito & Dillon, 1994): 

1 . Preprocess data to transform it into the correct format. 

2. Assign a given dimension to the Kohonen layer. 

3. Determine the weights of the Kohonen network. 

4. Delete the irrelevant attributes that are identified as having zero weight vector 
components to all the output nodes. 

5. Use either the threshold technique or breakpoint technique to determine the 
contributory and inhibitory inputs for the initial rules. 

6. Using the antecedents of the rules, define clusters of output nodes, each 
cluster associated with a given rule (see details below). 

7. If the number of nodes not belonging to a cluster is large, assign a new 
dimension to the Kohonen layer and go to step 3. 

8. Assign virtual labels to the clusters. 

A description of steps 1 to 3 can be found in Kohonen (1990). 

In Hybrid BRAINNE the threshold technique is employed in step 5. This 
technique uses the premise that inputs with the largest weights to an output node 
contribute most to that node. However, due to problems such as noise inherent in 
real-world domains, inputs with weights within a certain limit from the maximum 
weight are considered. 

All inputs i are included as contributory in the antecedent of the rule for output 
node j in the Kohonen layer if: 

Iw_j-WJ<T 

where the threshold T is a number between 0 and 1. Note that if . is not 
sufficiently large, no input is considered contributory. Inhibitory inputs are 
determined by the same means, but they lie within a threshold, T^^, from zero. 
These inputs are negated in the antecedent. Inputs that are not selected as 
contributory or inhibitory are ignored. 

Clusters are defined as groups of nodes in the Kohonen layer which have the 
same antecedent. Each cluster is assigned a virtual label: C,, C 2 , ..., C„. The rules 
derived from this stage are then of the form 

X, A X2 A . . . A X„ =» Q, 



where x^ is the symbolic equivalent of input k and Cj represents cluster I. 
Nodes which do not belong to any clusters could represent: 
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• intermediate concepts or higher level concepts in a concept hierarchy that is 
yet to be identified, or 

• a mixture of instances of topologically adjacent clusters which does not 
represent a separate concept. 

The first case is identified by the existence of one or more inputs which are neither 
contributory or inhibitory. Corresponding nodes are designated a new cluster. If the 
number of nodes which fall into the second case is unacceptably large, a new 
output layer with different dimensions is specified and the process is repeated until 
the number of unidentified nodes reduces to an acceptable level. 

3.2 The supervised layer 

During training in this second layer, we note that the example data is suitable for 
supervised training in that the target output class for each example is given. Hence, 
we also know the number of output classes. The first layer constructs the initial 
conjuncts which form clusters; each node in the Kohonen layer has a conjunctive 
rule attached to it. A cluster is merely a group of nodes with the same antecedent. 

The second layer of the hybrid network is used to construct the disjuncts between 
the clusters. It consists of a single layer perceptron with every node in the Kohonen 
layer connected to every node in this layer. Activation of the output node a.: 

aj = /(Sa,w.j-T) 

where a, is the activation of input node i, and w^ is the weight between nodes i and 
j, and T is the threshold. The activation function, /, is sigmoid. The delta rule is 
used for training; the weights are updated according to: 

where 5^, the error for output unit j, is given by 8^ = t^ - the difference between 
target t. and the actual activation. The learning rate, £, usually lies between 0 and 1. 
All weights in this layer are initialised to 0 before training. 

Training is performed by running the input through the rules produced by 
Unsupervised BRAINNE, firing the appropriate nodes in the Kohonen layer. These 
activations are then passed to the output nodes via weights. The error between the 
actual activation of the output and the desired activation is calculated and the 
weights in the second layer are adjusted accordingly. This process is repeated until 
the error is reduced to a predefined tolerance, when the training is stopped. 

When a training example is presented to the input of the first layer of the hybrid 
network, only those clusters in the first layer whose rules match the input are 
activated (output of 1). All other nodes are deactivated (output of 0). The weights 
between active clusters and those outputs in the supervised layer whose desired 
activation is 1 will be incremented, while the weights between the same clusters 
and the remaining outputs will be decremented. If the clusters are active when a 
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particular output node is supposed to be active, training will ensure the sum of all 
the weights from the active clusters to the output node is greater than its threshold: 

Sa^w,j>T. 

This means that, once training is complete, any cluster with positive weights to an 
output will independently define that output, ensuring that it can form part of a 
disjunct if other such clusters exist for the same output. Hence, clusters with 
positive weights to output j can be combined in disjunction: 

C, V C2 V ... V C, =» Xj 

where X. is the symbolic equivalent to output node j. This is equivalent to 
substituting X., in the conclusion of the rules defining C, to C„. If is in the 
disjunct for X., and 

X, A X, A . . . A x„ => C 

1 z n 1 

is defined, then the following new rule is added to the production rule set: 

X, A Xj A ... A X„ => Xj. 

A hierarchy of concepts is then produced by selecting rules with the least number 
of contributory inputs and replacing matching sections of antecedents in the rule 
set with the conclusion of the selected rules (Sestito and Dillon 1994). An example 
is shown in Table 1 . The corresponding concept hierarchy is given in Figure 2. 



Table 1 Forming a Concept Hierarchy 



Original rule set 


After forming hierarchy 


(a A b) => (X) 


(a A b) => (X) 


(a A b A c) (Y) 


(Xac)=>(Y) 


(aAbAd)=>(Z) 


(X A d) (Z) 


X 

/\ 
Y Z 



Figure 2 Concept hierarchy. 
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The rule set may then be reduced by selecting the highest level rules in the concept 
hierarchy that define only one output. 

If there exist outputs that do not have at least one unique mle it is possible that 
the unsupervised layer is not large enough to capture the information required for 
defining these outputs. In this case the dimensions of the unsupervised layer may 
be increased and training repeated. 

3*3 Results for extraction of disjunctive rules 



LED data 

The LED data problem involves seven inputs representing light emitting diodes on 
a digital display (Figure 3). The training set consists of ten unique examples 
representing the ten decimal numbers. 



InputO 



Input 1 
Input4 



Inputs 



Input2 



Inputs 



lnput6 



Figure 3 Seven inputs for the LED data problem. 

The solution to this problem does not require disjunctive rules, so in order to test 
the ability of Hybrid BRAINNE to derive disjunctive rules, the training set was 
modified so that only five output classes exist as shown in Table 2. This means that 
there could exist six separate rules for output 4. 



Table 2 Modified LED Training Set 



Actual digit output 


New output 


0 


0 


1 


1 


4 


2 


7 


3 


2, 3, 5, 6, 8, 9 


4 
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CO 


GO 


G1 


G1 


Gl 


ic2 


'C2 


GO 


GO 


G1 


G1 


G1 


a 


Cl 


C3 


G3 


G4 


G4 


G4 


|g5 


[c 2 


C3 


C3 


G6 


G6 


G6 


la 


0 


C3 


G3 


G6 


G9 


G9 


|cio 


fell 


C12 


G13 


G14 


G15 


GIO 


|g16 


Gll 


C12 


G13 


G14 


G15 


GIO 


|g16 


[Gil 



Figure 4 Kohonen layer clusters. 

After training the Kohonen layer on the ten training examples the clusters show 
in Figure 4 were identified. Once the second layer was trained, relevant outputs 
were added to the conclusion of each rule. The rule set at this stage is listed in 
Table 3. 

A hierarchy of rules was then constructed. The five separate structures are shown 
in Figure 5. The upper diagram shows where the clusters reside within the 
hierarchy, while the lower diagram is a representation of the inputs that are on, off 
or not specified. 



C8 


C5 


C13 

i 

: 




1 


Cl 6 C2 


C7 


C12 


/\ 


/\ 




Cll 


C4 C9 




CIO C14 


Cl C6 





CO C3 




Figure 5 Clusters forming the five separate structures. 

Taking the highest level clusters from the hierarchy which do not have clashing 
outputs results in the rules set given in Table 4. Only two disjunctive rules were 
required for output 4. 





80 



Part Two Data Mining 



Table 4 Final Rule Set 
Trimmed rule set 

(InputOAlnputlAlnput2A-iInput3Alnput4Alnput5Alnput6) => (OutputO) 
(-ilnputOA-Jnputl Alnput2A>-iInput3A-iInput4Alnput5A-iInput6) => (Output 1 ) 
(-iInputOAlnputlAlnput2Alnput3A-iInput4Alnput5A-iInput6) =» (Output2) 
(InputOA--iInputlAlnput2A-iInput3A-Jnput4Alnput5A-iInput6) => (Output3) 
(InputOAlnputl Alnput3Alnput5Alnput6) => (Output4) 
(InputOAlnput2Alnput3Alnput6) => (Output4) 



4 IMPROVING THE GENERALISATION CAPABILITY 

4.1 Improvement of generalisation capability by avoiding overtraining 

When a network is overtrained its generalisation capabilities decrease. A properly 
trained network can respond confidently with data it has not seen previously. An 
overtrained network, however, tends not to classify unseen data properly, while it 
fits the training data very well. An overtrained network is said to have memorised 
the training pattern. Thus an overtrained network guarantees that when the network 
is provided with a pattern from the training set, it will respond exactly in the same 
manner that it was trained (El-Sharkawi, 1996). 

A network is said to have good generalisation capability when it can respond 
well with data it has not seen before. There are several different ways to ensure that 
a network is trained, and has not just memorised. All these techniques are based on 
the fact that, for given training and test data, a network should respond with a 
similar error for the training and test data sets (El-Sharkawi, 1996). 

For BRAINNE, the training set is divided into two subsets (one larger than the 
other). The larger subset is used to train the network and the smaller subset is used 
to monitor the error, and this is used as the basis of the stopping criterion. 

After each iteration of the network, the test data is used to calculate the error 
measures of the network. We then try to determine the global minimum for the test 
data error and stop training the network at that point. The training error usually 
decreases as the number of epochs of training increases. However, the test error 
decrease goes through a minimum and then increases. 



Table 5 Example Test Error Values 
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Table 3 Initial Rule Set 
Initial rules 

(InputOAlnputlAlnput2A-iInput3/Jnput4Alnput5Alnput6) => (COAOutputO) 
(InputOAlnputlA-iInput2Alnput3Alnput4Alnput5Alnput6) =»(ClAOutput4) 
(-ilnputOAlnputl Alnput2Alnput3A-iInput4Alnput5A-iInput6) => (C2AOutput2) 
(-JnputOA-iInput 1 Alnput2A-iInput3 A-iInput4Alnput5 A-iInput6) => (C3 AOutput 1 ) 
(InputOAlnputlA-Jnput2Alnput3Alnput5Alnput6) => (C4AOutput4) 
(InputlAlnput3Alnput5) (C5AOutput2AOutput4) 

(InputOAlnputlA-Jnput2Alnput3A-Jnput4Alnput5Alnput6) => (C6AOutput4) 
(InputOAlnputlAlnput3Alnput5Alnput6) => (C?AOutput4) 

(Input2Alnput3) => (C8AOutput2AOutput4) 
(InputOAlnputlAlnput3A-iInput4Alnput5Alnput6) => (C9AOutput4) 
(InputOAlnputlAlnput2Alnput3A-iInput4Alnput5Alnput6) => (C10AOutput4) 
(InputOA-iInputlAlnput2Alnput3Alnput4A-iInput5Alnput6) => (Cl 1 AOutput4) 
(InputOA-iInputlAlnput2A-iInput3A-Jnput4Alnput5A-iInput6) => (C12AOutput3) 
(InputOA— iInputlAlnput2A— iInput4Alnput5) (C13AOutput3) 

(InputOA-Jnput 1 Alnput2Alnput3 A-iInput4Alnput5 Ainputb) => (C 1 4AOutput4) 
(InputOAlnput2Alnput3A-iInput4Alnput5Alnput6) => (C15AOutput4) 
(InputOAlnput2Alnput3Alnput6) =» (C16AOutput4) 
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Epoch Test Error 



11 


4.32 


12 


4.20 


13 


4.50 


14 


4.30 


15 


4.10 


16 


4.80 


17 


5.30 


18 


5.58 



For BRAINNE, we compared the test errors from one iteration to the next. If the 
error decreases and then starts to increase consistently the training is stopped at the 
point where the error starts to increase. For example, for the error values illustrated 
in Table 5 and Figure 6, the training is stopped after epoch 15 since the error value 
increases consistently after that point. 

6 
5 



2 

1 




Figure 6 Example test and training errors. 

By avoiding overtraining, the generalisation capability of BRAINNE should 
increase, thus allowing better performance with previously unseen data. 
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4.2 Improvement of bounds 

For continuous data, BRAINNE produces conjunctive rules where the attributes 
usually have an upper bound and a lower bound. For instance, the rule: 

IF (xl<a<x2)AND(yl <b<y2)THENC 

can be represented as: 



■F 

xl x2 



AND 



< — B — ^ 

H 1- THENC 

yi y2 



where {xl, x2 e X : X e and {yl, y2 e Y : Y e 5R}. 

These bounds are created during training using the training data. They reflect the 
upper and lower values within the training set for a contributory input. The 
problem with this approach is that the bounds may be too rigid for a new data set. 
We could have a situation where the unseen data range may fall outside the bound 
generated by BRAINNE. The following can happen; 

• Lower range of the new data set may fall outside the generated bound. 



xl x2 

Upper range of the new data set may fall outside the generated bound. 



xl x2 

• Both upper and lower ranges may be outside the generated bound. 



«■ 



xl 



A 






x2 



The most obvious way to overcome this problem would be to increase the bounds 
towards the maximum ([0,1] for normalised data) for a contributory input. This 
however, can present us with new problems of misclassification. BRAINNE is 
tested for two attributes: 



1. coverage of data (i.e. % of examples covered by the rules generated by 
BRAINNE); 

2. correctness of classification. 
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To increase the generalisation capability we need to increase the coverage of 
unseen data. However, we also have to ensure that we do not increase the 
misclassification rate in the process. It is possible to have 100% coverage of the 
unseen data, but it is very likely that the network will have a much higher 
occurrence of misclassification. 

Using the approach given in Sestito and Dillon (1994) an initial set of bounds for 
the contributory inputs is generated. We then increase these bounds progressively, 
checking for misclassification after each increment. If an increase leads to a 
misclassification, the bound is returned back to its previous value. The minimum 
bound and the maximum bound for each attribute were increased separately. 

The steps for improvement in bounds are as follows: 

1. Increase maximum/minimum bound for an attribute by 0 (usually around 

10 %). 

2. Check for misclassification. 

3. If classification error is produced, decrease bound by 0 else repeat step 1 . 

4. Pick next bound and go to step 1. If all bounds for all attributes have already 
been processed, stop. 

This method is used for each contributory attribute of a rule. For normalised data, 
the largest value for the upper bound is 1 and the smallest value for the lower 
bound is 0. 

4.3 Results for the generalisation capability 

The methods discussed in Sections 4.1 and 4.2 were tested using the IRIS domain 
from the Irvine machine learning databases. 

The rules generated by the system were tested for coverage and classification 
capability using the training data, overtraining test data and unseen test data. The 
data contains 3 different types of plants described by 4 input attributes: 



Table 6 Iris Domain 



Input Attributes 


Outputs 


sepal-length 


Iris Setosa 


sepal-width 


Iris Virginica 


petal-length 


Iris Versicolor 


petal-width 
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The input attributes are normalised using the formula: 



Normalised Value = 



Actual Value -Min 
Max - Min 



This domain consists of 150 examples. Of these examples, 90 were used to train 
the network, 30 to test for overtraining and the remaining 30 (unseen) examples 
were used to test the performance of the rules generated by the system. The 
network was trained with 6 hidden units, learning rate of 0.2 and momentum value 
of 0.7. The results are summarised in Tables 7 to 9. 



Table 7 BRAINNE 



Data Set 


Classification Incorrect 


Covered/Not Covered 


Training Data 


3 


16IU 


Test Data 


0 


2614 


Unseen Data 


1 


21/9 


Table 8 BRAINNE after Overtraining Check 


Data Set 


Classification Incorrect 


Covered/Not Covered 


Training Data 


3 


77/13 


Test Data 


0 


26/4 


Unseen Data 


1 


2311 



Table 9 BRAINNE after Increase in Bounds 



Data Set 


Classification Incorrect 


Covered/Not Covered 


Training Data 


3 


87/3 


Test Data 


0 


28/2 


Unseen Data 


1 


29/1 
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A slight improvement in the coverage capability after using the overtraining test 
can be seen in Table 8. From Table 9, there is a significant improvement in the 
generalisation capability of BRAINNE without any reduction in classification 
capabilities. 

5 CONCLUSION 

This paper presented three different sets of improvements to the BRAINNE 
method. The first of these uses a two layer hybrid network, with unsupervised 
learning at the first layer and supervised learning at the second, to extract 
disjunctive rules directly. This avoids the need in the previous approach for 
multiple resolving of modified networks corresponding to different sets produced 
by a generate-and-test procedure. 

The second approach stops training earlier by monitoring the error measures 
using a separate test set of examples. This improves the generalisation capability of 
the neural network and of the derived rules. 

The third improvement leads to progressive extension of the bounds used in 
defining rules for the case of continuous features. All of these approaches were 
tested on different sets of examples and improvements were obtained. 
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Abstract 

The aim of this research is to establish a coherent framework for data mining. 
Observing that data mining depends on two partitions, the classifier and the 
estimator, this paper defines the classifier /estimator (CE) framework. The 
classifier indicates the target of the data mining investigation. The classifier 
may be difficult to express from the data instance or may involve an “oracle” 
beyond the extant data. The estimator is typically simply expressible using 
the data instance. The degree to which the estimator refines the classifier 
partition can be used to measure how well the data instance matches the 
concept being investigated. 

The CE framework is shown to generalize a variety of data mining and 
database concepts, including rough sets, functional dependency, multivalued 
dependency, and association rules. Furthermore, the CE framework suggests a 
wider range of data mining questions. The CE framework is shown to naturally 
express qualitative and quantitative measures of the quality of approximation. 
Additionally, the CE framework allows a question to be posed at a number of 
different conceptual scopes from local to global interests. 

Keywords 

Data mining, framework, partition, relational model, rough sets, association 
rules, functional dependencies, multivalued dependencies, error 



1 INTRODUCTION 

Data mining faces a predicament similar to that databases faced prior to 
Codd’s (Codd 1970) introduction of the relational model: different data min- 
ing problems seem to have little to do with one another, approaches are gener- 
ally ad-hoc, there is no concise or precise means of specifying problems, and so 
forth. This situation has been observed elsewhere, for example (Imielinski and 
H.Mannila 1996, Mannila 1997). As with databases when all access was nav- 
igational, data mining semantics is often defined by implementation, in this 
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case search. There is no distinction between what is being sought and how 
the search is being carried out. While some of this admittedly is the result 
of data mining’s diversity, there do exist enough common features to provide 
uniformity, if not for all of data mining, at least for a well-defined portion. 
The aim of this research is to establish a coherent framework for data mining; 
in this paper we focus on the relational model (Codd 1970, Abiteboul, Hull 
and Vianu 1995a, Ullman 1988a). Our exemplar is the success of the rela- 
tional model in addressing two particular issues in databases: providing data 
independence and a rigorous mathematical model. We address comparable 
problems in data mining with the hope of achieving like-minded results. 

Data mining is the elicitation of useful information from large amounts 
of data (Fayyad and G.Piatetsky-Shapiro 1996a). The kind of information 
generally falls into one of four broad categories: facts, models, deviations, and 
trends (Simoudis 1996). Facts are directly supported by the data and include 
information such as “if symptom(alpha-fetoprotein), then cancer (liver).” 
Modelling characterizes the data and enables the user to seek information 
entailed by the data, such as predicting the class to which a new data item 
belongs. Deviations look for data that does not belong. Trends look at data 
in the context of a temporal dimension, time being promoted to a basic type. 
Our framework supports the first three categories. An investigation is made 
with respect to some concept that varies along two dimensions. First, there 
is a degree of generalization/specialization, in that the mining may look for 
general aspects of all the data or specific details of a select subset Second, 
there is a degree of approximate validity, in that the mined information lies 
somewhere between true and false 

To be somewhat more concrete, suppose our concept is X()Y, where 0 
indicates some relationship between data components X and Y. Irrespective 
of how we interpret X,Y,0, we can envision a degree of detail/generalization. 
For example, at the most general we are interested in all X,Y that might 
hold for 0- In this case, X,Y are free and are found in some X-space and 
F-space, respectively. But we are also interested in less global information, 
e,g., all Ys that hold for a distinguished X. In this case, Y is free and X 
is fixed. The second dimension is degree of quality of approximation. Since 
“true” information is hard to come by, we settle for its approximation, the 
usefulness of which is decided by both the goodness of the approximation 
and our need. While specification of a good approximation depends upon the 
application, there is only one specification of absolutely true information, what 
we call perfect information, against which all approximations will ultimately 
be judged. 

The next section gives a glimpse of our framework via a standard database 



*A notion that, while too broad to define formally, is understood to establish a context of 
what one is interested in. 

* Variations of this are commonly refered to as “roll-up” and “drill-down” in the data cube 
approach (Harinarayan, J.D. Ullman and Rajaraman 1996). 
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concept, functional dependency. Section 3 examines a popular data mining 
framework, rough sets, and begins to formally develop our framework; section 
4 completes this formalization. Section 5 shows how some typical database and 
data mining concepts are characterized in terms of our framework. Section 6 
develops a suite of quantitative and qualitative metrics. The remaining section 
gives a summary and points toward future work. 



2 INTRODUCTION TO THE FRAMEWORK 

We begin by choosing a concept for which to mine; we can choose any one 
of a number of concepts, but decide upon functional dependency (FD)(Codd 
1972, Vardi 1987, Abiteboul et al. 1995a, Ullman 1988a, Codd 1972). FDs 
are well-known, have a formal definition, support a well-studied theory, and 
have proved invaluable in their practical application in many different areas. 
We first present an informal description of FDs. We next explain how our 
framework perfectly characterizes this concept. We then discuss how the CE 
framework can vary the two dimensions (generalization/specialization and 
truth approximation) of FDs. 

We now present a formal definition of functional dependency in the context 
of a specific example; this formalism is, however, general in that replacing 
the specific attribute names with more abstract “X”, “F”, or “Z” gives a 
suitably general definition of FD. The example continues the medical theme: 
the relation ADMIT has four attributes: pnt, phy, date, and hos. The meaning 
of the tuple t G ADMIT is that patient t.pnt was admitted to hospital t.hos 
on t.date with physician t.phy attending. A medical care manager might 
interested in whether every physician practices (as attending physician) at 
only one hospital. If this were true, it would mean that the FD phy hos 
holds. Using notation such as ADMIT.phy to indicate the set of all physicians 
in the relation ADMIT, the above FD formalizes as 

(Vp G ADMIT.phy) (3/i G ADMIT.hos)(Vt G ADMIT)[f.phy = p ^.hos = h]. (1) 

We now describe an informal (the formalism follows later) characterization 
of phy hos in the CE framework. Given an instance ADMIT for which we 
wish to test phy — > hos, we first define two classes of parameterized sets: 
Ep = {te ADMIT|t.phy = p} and Ck = {t e ADMIT|t.hos = h}. It is obvious 
that distinct p values give distinct and disjoint Ep and that each t G ADMIT 
belongs to one Ep — for the p such that t.phy = p. Thus, pj^yg is 

a partition of ADMIT. Similarly, {Ch}hekmiT.hos ^ partition of ADMIT. The 
stage is now set, since the CE framework requires two partitions, the classi- 
fier a.nd the estimator--in this case, {C'/i};,^ADMIT.hos {^p}peADMIT.phy » 
respectively. Then phy — > hos holds iff every Ep is a subset of one specific Ch- 
Writing this symbolically, including an explicit formulation of subset, gives 
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Figure 1 One possible interpretation of “approximate” functional dependen- 
cies illustrated in the CE framework. Each pie piece represents a single class 
of the classifier, a partition on the attribute hospital, and corresponds to 
one of three possible hospitals: Mercy, Memorial, and University. Each filled 
ellipse represents a single class of the estimator, a partition on the attribute 
phy, and corresponds to one of three physicians. Smith, Brown, and Jones. 
One class of the estimator, the dashed ellipse containing Jones, lies across two 
classes of the classifier, a straddler. Suppose th ^ ^Brown\/\^\ ^ 

Then phy hos. 



exactly the same form as our initial sentence, viz., {yEp){3Ch)[Ep C C7/i] or 
equivalently, (V£p)(3Cy)(Vt € ADMIT) [t £ Ep t e Ch]- This alternative for- 
mulation provides nothing to standard functional dependency theory (beyond 
demonstrating the adequacy of the CE framework to capture this important 
concept), but the classifier /estimator pair is the central feature of data mining 
to which we now return. 

Another dimension in which an FD can be investigated is to see how well it 
fits the data (this is related to (J.Kivinen and Mannila 1995)). Since phy 
hos is not the consequence of any business rule, there are a variety of questions 
the medical care manager might pose: does it, in fact, always hold true (the 
perfect case), does it hold for a particular physician and hospital, does it hold 
with a few exceptions for each physician, does it hold for most physicians 
but not for a small number of exceptions? The latter two cases are different 
varieties of approximations. Approximations are awkward to express in an 
FOL formulation of functional dependencies, but the many variations are all 
natural using the CE framework. The common aspect of any approximation 
of X — > y is that deviations are always associated with those estimator sets 
that split over more than one classifier, that is, the Ex that “straddle the 
boundary” of some Cy. Such an Ex we call a straddler. 

One approximation of X — ^ F (written X F) might take as valid X F 

if 90% of the tuples of r belong to an Ex that does not straddle any Cy — 
a global evaluation (see Figure 1). Another approximation criterion is local. 
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finding all x such that 90% of Ex belongs to the corresponding Cy. If this 
percentage parameter is less than 50%, this could yield an approximation to a 
functional dependency, e.^., for phy and hos, which was no longer functional, 
in the sense that Jones Memorial and Jones University. 

3 FROM ROUGH SETS TO CLASSIFIER/ESTIMATOR 

Rough sets have become a popular framework for data mining investiga- 
tions (Pawlak, Grzymala-Busse, Slowinski and Ziarko 1995, Slowinski and 
Stefanowski 1996, Lin 1997) but they must be generalized to capture many 
important database concepts that also pertain to data mining. In this section, 
we introduce rough sets, discuss why this framework fails to capture certain 
other concepts, introduce a broader framework, and show how this framework 
generalizes rough sets. 

Rough sets were introduced by Pawlak (Pawlak 1982, Pawlak 1991) as a 
mathematical tool to reason about vagueness and uncertainty. Pawlak credits 
Frege’s boundary-line view — that a property is vague if there exists objects 
for which neither the property nor its complement completely hold — as his 
motivation. He finds this boundary by differencing two sets: the upper-bound 
that contains objects for which the property possibly holds and the lower- 
bound for which the property certainly holds. The possibility and certainty are 
established by operating over partitions, requiring that the property must hold 
for at least one member of the entire class (possibility) or for the entire class 
(certainty). A rough set application attempts to approximate a property P, 
where a property is merely a subset of a finite universe U. This approximation 
is achieved using an equivalence relation S over W, establishing two bounds: 

lower-bound r^Pe — [j{E\E C P, for E € £}. (2) 

upper-bound yPe = \J{E\E fl P 0, for P G €}. (3) 

Three ideas, crucial to data mining, were implicit (but not explicit) in the 
rough sets framework. They are: 

1. Analyze a property P by investigating its relationship (s) with certain par- 
titions of U. 

2. Evaluate the interaction of the property P and the partition £ at the level 
of individual classes of £. 

3. Measure the “goodness” of approximation of P by 5 in terms of vFfe - aP^:, 
that subset of U for which membership in an f-class does not determine 
membership or non-membership in P. 

As Wets discussed above, functional dependency is a concept from database 
design theory which suggests a variety of data mining investigations. Unfor- 
tunately, FDs are completely beyond the reach of the traditional rough set 
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characterization because a property only provides a binary partition of U 
(namely, the partition {P,P}). 

With this in mind, we define the Classifier/Estimator (CE) framework in 
terms of a classifier C and an estimator £. Throughout the remainder of this 
paper, U is the universe, a finite set; complements are taken with respect to 
U unless explicitly specified. C and £ are partitions of U, called the classifier 
and estimator respectively. A partition of course induces and is induced by 
an equivalence relation on U; for partition B, we write xBy to indicate that 
X and y belong to the same 5-class. The use of £ is entirely consistent with 
the £ of rough sets; C is a generalization of the partition {P, P}. We use V to 
denote the classifier {P,P}. 

Rather than using the concepts of upper-bound and lower-bound, which 
fit a single property P but not an arbitrary partition C, the CE framework 
generalizes the set difference of upper-bound and lower-bound. The critical 
factor here is that this difference is a union of f-classes that straddle the 
boundary of some C-classes. We formalize these notions: 

Definition 1 For E e £ and C E C, E straddles C, written £* 0 C, iff 
E n C ^ ih and E 0 C ^ 0. Such an E, irrespective of the C, is called a 
straddler. ■ 

Our boundary then, is made up of elements from straddlers. This boundary 
is called the indeterminate set (for reasons we will give later) and is defined 
as follows: 

Definition 2 The indeterminate set, written J^, is 

Iq = \J{E\E is a straddler}. ■ 

Proposition 1 P — I^= 

Proof P-i^ = n P) - 

= [j{E\E C P} U [j{E\E 0 P} - [j{E\E 0 P} 

= [j{E\E CP} . 



Proposition 2 PUlp = vP^:- 

Proof Pui^ = U{^nP|PnP^0} u 

U{i^|snp^0 A p-P/0} 

= ui^^l^^nP/0 A p-p = 0} u 

U{P|^ HP 7^0 A P-P^0} 

= {j{E\EnP^Hi} ^ 

Example 1 (Rough Sets) Let U = {ti,t 2 ,^ 3 ,^ 4 ,^ 5 } with an estimator £ = 
{{^ 1 , ^ 2 , ^ 4 }, {^ 3 }, {^ 5 }} and binary property P = {P, P}, where P = {ti, ^ 2 , ^ 5 }. 
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Then Ip = {ti,t 2 ,U} and the lower and upper approximations of P are 
aPs = P - Ip = {ts} and vPf = PUl^ = 




U is partitioned into halves, 
the shaded portion is P and 
the unshaded is P. The classes 
of £ are demarcated by the el- 
lipses, the dashed ellipse is a 
straddler. 



Observe that, for any binary classification, there is a duality between the 
lower and upper bounds: s/P^ =U - aPs- This duality is an artifact of the 
size of P, since when \V\ = 2, there is only one way in which E can be a 
straddler. When \P\ > 3, however, the duality breaks down. The intuition 
is that lower-bounds remain stable because they are completely contained 
and therefore, any n-ary classification cannot affect this. The upper-bound 
of one class, however, could straddle up to n - 1 classes. It is for this reason 
we choose “indeterminate” for the boundary, since this set contains elements 
that are indeterminlate wrt their membership in classes of the classifier, given 
the estimator. 



4 INDETERMINATE SETS 

Now that we have shown how the CE framework handles rough sets, we take 
another look at the indeterminate set. 

Suppose for some £, C, that l| = 0. We know this means there are no strad- 
dlers, but there is another meaning associated with this condition, partition 
refinement Given two partitions A, B over ZY, P is a refinement of written 
v4 P, read is less refined than P,” iff, for every class B £ B, there exists 
a class A G A such that B C A. We now have the following proposition: 

Proposition 3 = ib iS C 4 £. 

Proof For any E E £, E intersects at least one C E C. But since 1q = ib, E 
is not a straddler and hence, E C\C = 9 and therefore, E C C. Now suppose 
C ^ £. Then there are no straddlers JS, and therefore, I^ = 0. ■ 

Consider now when I^ / 0. Although this means that C 4£ does not hold, 
we can associate some measure of “mismatch” with the set I^. This in turn 
provides us a means of approximating how much of a mismatch occurred. We 
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will discuss this more at length in Section 5, Metrics, but it suffices here to 
say that the indeterminate set can provide both qualitative and quantitative 
measures in a more general way than rough sets. 



5 APPLICATIONS OF THE FRAMEWORK 

The goal of this section is to validate the CE framework by showing that it 
captures a wide variety of database and data mining themes. 

In each case we will exhibit a classifier and an estimator that perfectly char- 
acterize the concept at hand. But just as importantly, in each case it is also 
possible to transform the perfect version into a range of data mining investi- 
gations varying along the breadth/specificity and approximation dimensions 
(as in Section 2 above). Each of those variations is implicit in some special 
treatment of the interaction of the C and S classes, but we do not give any 
variation explicitly at this point. A subsequent section, “Metrics”, discusses 
a toolkit for evaluating approximations. 

To facilitate our discussion, let R[A] denote a relational schema where A = 
{Ai , . . . , A„}, and r an instance over R[A]. For X C A, we write [X] to mean a 
partition of r into sets of tuples that agree on X. Formally s[X]t s,X = LX, 
for € r. 



5.1 Functional Dependencies 

Dependency theory has long been used to facilitate relational database de- 
sign, but dependencies also arise quite naturally and often by themselves in 
databases. Much study has been devoted to this area, and there are numerous 
important and useful results. There is, for example, a small set of inference 
rules (usually called Armstrong's A x2om5( Armstrong 1974, Abiteboul, Hull 
and Vianu 1995b)), both sound and complete, for FDs. With these axioms, 
we can reason with and about FDs. Clearly, if we had a means of establishing 
whether some number of FDs held (perfectly or approximately), we could use 
these axioms to draw further, logically correct, conclusions about our data. 
Here we show how the CE framework establishes when an FD exists, using an 
equivalent formulation: X -► F iff (V5 G r)(V^ G r)[ 5 .X = LX => s.Y = t,Y]. 

Proposition 4 X-^Y iff = 0. 

Proof Suppose X Y. We show that E € [X] does not straddle any C € [F]. 
Since, for every s,t e r, s.X = t.X -+ s.Y = t.Y, it follows that £ C C for 
some C and E f) C = 9 for any other C, so E is not a straddler. Now, 
suppose l|yj' = 0. This means for every £ € [X], there exists some C € [F] 
such that E C C and E D C = Hi, for all other C € [P]. In other words, 
s,t 6 r, s.X = t.X — » s.F = t.Y. Hence, X — v F. ■ 
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5.2 Multivalued Dependencies 

Although data mining has largely ignored multivalued dependencies (MVDs) 
(Delobel 1978, Fagin 1977, Ullman 1988b), they too are an interesting kind 
of dependency for a number of reasons: MVDs occur quite frequently, are the 
“flipside” of FDs (two sets of attributes are wholly independent rather than 
functionally dependent), and like FDs, there exists a sound and complete set 
of axioms for reasoning about MVDs(Beeri, R.Fagin and J.H.Howard 1977, 
Ullman 1988b). Essentially an MVD requires that a set of Y values be as- 
sociated with a particular X value. In fact, when the relational formalism is 
extended to allow sets as well as atomic values, an MVD is simply an FD with 
set-valued attribute on the right-hand side. Before giving the CE character- 
ization, we formally deflne an MVD. 

In the following, say A = X U V U Z, where X, F, Z are pairwise disjoint. 
We write t G r as {x,y,z), where t.X = x, t.Y = y, and t,Z = z. Also, for 
tuple 5 , Ny{s) = {y\{s,X,y,s.Z) € r}. 

Definition 3 X multidetermines Y in the context Z (mention of Z is often 
ommited since it is implied by Z = A - (X U F)), written X F|Z if 
whenever G r and s.X = t.X, then Ny{s) = Ny{t). ■ 

With this formulation and C,5 pairs is given hy £ = [X] and C such that 
sCt iff Ny{s) = Ny{t)^ the desired result is immediate. 

Proposition 5 X F|Z iff C ■ 



5.3 Association Rules 

Association rules ( ARs) have been gaining popularity in both data mining and 
databases, discussed in (Agrawal, Imielinkski and A.Swami 1993, Agrawal and 
R.Srikant 1994, Mannila, H.Toivonen and A.Inkeri Verkamo 1994, Mannila 
1997). As a concept, ARs begin with an instance r over i?[A], where each 
attribute A* G A has a boolean domain {-h, For W C A, we write 
to mean the set of tuples {t\t G r A t.Ai = -h, for each A* G W}. Let 
A = X U F U Z and without loss of generality let X = {A \ , . . . , A^} and F = 
{Afe-fi, . . . , Am)- We use “=>” to indicate an AR; in particular the expression 
Ai A . . . A Ajk => Afe+i A ... A Am, signifies that for t G r, t G X^. implies 
there is a tendency that t G Ff . This tendency is indicated by a confidence 
c and support s. The confidence is meant to denote strength and is the ratio 
|XF+ 1 / 1 X 4 . |. The support provides the overall frequency of the rule and is 
the ratio |XF 4 _|/|r|. Rules that have high confidence and strong support are 
said to be strong. The task is to discover strong association rules. We will 
show later how an association rule can be handled by a local application of 
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FDs, but first we directly characterize AR in the CE framework. Because the 
issue of partition refinement is “value-blind”*, we must make certain that r 
contains a tuple with a -f in every attribute — this may be accomplished ad 
hoc by adding a tag attribute, with a value , to all tuples of the original r 
and then adding tuple with value for all attributes, including the tag. 

With this slightly constrained r, we move to the CE definitions. The clas- 
sifier is C = {Xy+,XF+} and the estimator is ^ = {X^^X^}. 

Proposition 6 X => F,c = 1 iff C f. 

Proof Suppose X F,c = 1. By th e defeition of c, \XY^\ = |X+|. Since 
XKf C X+, XY^ = X^-, and XY^ = X+. Now, suppose C 4 £, Since 
e XY^nX^^ this can only hold if XY^ = X^. This gives c = 1 as required. 



X Y 

/ "" ^ ^ s 

An AR is strongly related to an FD, i.e. +,..., H — ►+,..., -h. in the example 
above. Using localization of FDs allows exploration of significant associations 
where some attributes are negative as well as some positive. This is important, 
for example, in medical diagnoses. 

Example 2 Suppose each tuple is a diagnostic test that looks for a positive 
reaction helping differentiate characteristics of similar microbiologic genera. 
The tests performed are Gelatinase, Mannitol, Inositol, OF(glucose) (a ‘-h’ 
means a positive test) and whether the organism is toxic (a means the 
organism is toxic.) On the left is the instance. On the right, rules 1. and 2. 
are classical association rules, while rules 3. through 5. relate negative as well 
as positive characteristics. 



test-id 
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OF-glu 


Toxic 
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2 
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- 
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+ 
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— 
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54 


+ 


- 


+ 


- 


- 


0F-glu_ 




Toxic- 


45 


+ 




+ 


+ 






■ 



Associations actually began with a very different representation. The char- 
acterization of ARs did not begin with the representation used above, although 
it is the current vogue, but with a form closer to the typical data represen- 
tation. An example in the medical context begins with the patient-symptom 
relation, denoted PS, with attributes pid and symptom. This data indicates 
the symptom weakness always occurs in the presence of symptom fatigue, 
symbolized as fatigue weakness. Then y z iff (Vp € PS.pid)[(p, p) 6 

’^This is called genericity in database query theory 
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PS {p, z) G PS]. To express this in the CE framework, define the estimator 
S = [symptom] and classifier C = {C'y,C'”}y^symptom> where 



Cy = {{p,y)\{p,y) e?S A {p, weakness) € PS}. (4) 

Cy = {(p,j/)|(p,y) € PS A (p, weaA^iess) PS}. (5) 

This classifier /estimator pair in fact facilitates the determination of all y 
such that y =4 weakness. To see this, observe that Ey = CyU C~ . C~ / 0 
iff there is some pid p such that (p, y) € PS but (p, weakness) ^ PS. So, 
y i4 weakness iff [symptom] (p) = Cy. 

Example 3 Suppose PS exists as the instance on the left. The CE is on 
the right. Observe that this instance exhibits fatigue^weakness and 
the trivial weakness:ii weakness as perfect assocations. The association 
cough:=tweakness, however, fails to hold since the estimator class, 
[symptom] (coup/i) = {tsjte}? is split into two in the classifier, 
pid symptom 

ti 1 fatigue 

t 2 1 weakness 

ts 2 fatigue ^ = {{^ 1 ,^ 3 }, {t2,<4}, {t5,te}} 

U 2 weakness c = {< 2 ,^ 4 }, {ts}, Oe}} 

<5 2 cough 

te 3 cough 



Of course with fixed y and z, y z can be transformed into an AR, 
viz., symptom = p => symptom = z. But this transformation must create a new 
Boolean attribute for each value of symptom. Except by explicit search, ARs 
cannot find all z such that symptom z, much less the even more general ques- 
tion considered next. On the other hand, exploration of associations involving 
several attributes requires quantification over many variables or, equivalently, 
joining the relation with itself, perhaps several times. 

Consideration of the more general question of discovering all y, z pairs such 
that y ^ z shows the first example where the base relation does not have 
enough “room” for relevant classifier and estimator partitions.”^ It is necessary 
to build a larger relation, in this case W is PS x PS . symptom. Then the partitions 
are Cz — {t\t = {p,y,z)} and Ey — {t\t = {p,y,z)}. Thus Ey straddles Cz iff 
it is not the case that y :=X z. This example also clearly shows that the CE 
framework is a conceptual tool and not a recipe for implementation, since a 
single pass through the data will suffice to compute the associative metrics 



This is an “input/output complexity” issue. 
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provided that PS . symptom is of moderate size. Again, exploration involving 
several attributes requires wide tuples. 



6 METRICS 

There is a large body of work on metrics for data mining (Silberschatz and 
Tuzhilin 1995, Fayyad and G.Piatetsky-Shapiro 1996b, Fayyad and G.Piatetsky- 
Shapiro 1996c, J.Kivinen and Mannila 1995, Mannila 1996). In the CE frame- 
work, the goodness of an estimator can be decided either quantitatively or 
qualitatively. The CE framework does not itself prefer one metric over an- 
other, but provides a way to describe and evaluate them. 

In both kinds of goodness measure, there is the dimension of breadth and 
specificity that enables us to focus the CE to our need. This leads us to an 
equivalent analytic definition of indeterminate set, 

3^ = |J{EnC| EnC for EeS,C eC}. (6) 



Proposition 7 For two partitions £,C over U, Iq = 3c 
Proof Suppose there is a straddler E in the union forming l|. Then let 
Cl , . . . , C/fc, be classifier sets that E straddles and E = \JEnCi. Since E 0 C*, 
each EnCi is in the union defining 3c smd thus, 3c contains the straddler E. 
Now, suppose there is an S in union forming 3c- Then S C E, where E E £ 
straddles some C EC. Hence, contains the straddler E and also S. ■ 

Observe that like 3c is also well-defined for two collections of sets £,C- 
This allows us to examine £ one column at a time and C one row at a time, 
viz., 3c = UfiefSc^^ = Uc€c3{c}- From now on, we write 3f and 3^, 
understanding they stand for 3c^^ and 3{c}i respectively. Note that Iq is 
more robust than Iq, since 3c = 3c £ ^^^icss = 0. 

Given an estimator £ = {Ei, . . . , E^} and classifier C = {Ci, . . . , there 
is a multitude of possible measures. Some of those that suggest themselves 
are 

1. Indeterminate count of P: ce{C) = 13c I- Somewhat local, made wrt a 
particular C-class C. 

2. Indeterminate count of E: cc(E) = |3c I- Somewhat local, made wrt a 
particular £^-class E. 

3. Indeterminate count at E,(7: |3cl- Very local, made wrt particular C-class 
and C-class. 

4. Subtotal indeterminate count C' C C, C' C C: C{£\V) = Yjcec ^S'{C) = 

5. Determinate precision of C: ds(C) = 
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6. Indeterminate precision of P: is(C) = 

7. Normalized determinate precision of P: ds{C) = • 

8. Total precision: C{£,C) = ]Cc6P^(^)* 

Each of these metrics can be used to evaluate which of these two estimators 
is quantitatively better. An estimator £ is qualitatively better than estimator 
P iSIq C in other words, P 4 £. 



Example 4 (Imperfect ARs) Since we already have a perfect characeteri- 
zation of ARs, we can relax the confidence to describe ARs that are less than 
perfect. For this example we will use the following relation r (left) 
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Say we are interested in B A D => A. Then £ = {{ti, ta, ^ 4 , ts}, {^2? ^7}} 

and C = {{ti,f4,^5},{^2,^3,^6,^7}}. The support is |3bda+I/M = "^3% and 
the confidence is |J5£)i4+|/|3^^^^| = 75%. ■ 

7 SUMMARY CONCLUSIONS 

This paper has presented a framework which 

• unifies a variety of data mining and database concepts. 

• establishes both a perfect condition and approximate condition. 

• provides a range of breadth/specificity so important to drill-down data 
mining. 

• supports a wide variety of quantitative and qualitative metrics 



Furthermore, many of the CE constructions are surprisingly simple. It is the 
fact that such a variety of topics are handled simply makes this framework 
significant. 

We began the paper by drawing parallels with relation database theory. 
Continuing this metaphor, we feel we still need to discover the analogs of 
SQL, QBE, efficient join algorithms, query optimization strategies, etc. In 
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addition, we intend to “push the envelope” of the CE framework in such 
areas as temporal information, nested relations, and sampling. 

One final example suggests a broad range of possibilities. The general issue 
involves relationships between the values of attributes. Consider a relation 
over X, y, Z. We are interested in all values t.x such that tY > t.Z. This is 
in fact an instance of a property, in that the classifier is merely a partition 
distinguishing those t for which tY > t,Z holds from those which it does not. 
Said another way, the specified problem is a kind of functional dependency, 
except that the dependent is derived rather than base data. The generalization 
is immediate; by using functions of the attribute values rather than just the 
values themselves to define the equivalence classes, a huge variety of data 
mining is possible. 
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Abstract 

Associative search is a key function for extracting information from databases. 
In this paper, we present a new associative search method with symbolic and 
semantic operations. This method integrates two kinds of associative search 
functions. The symbolic associative search is a simple pattern-matching-based 
function. This function is used as the information filter which repeatedly exe- 
cutes pattern-matching-based comparisons between data items. The semantic 
associative search function extracts semantically related information by math- 
ematical semantic operations based on the mathematical model of meaning 
which we have proposed. This function provides a context recognition mech- 
anism for extracting semantically related information from databases. This 
mechanism makes it possible to put the semantically related data items in or- 
der, according to the correlation to the searcher’s impression. The integrated 
associative search method realizes the advanced information extraction by 
combining the symbolic and semantic associative search functions. 
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1 INTRODUCTION 

Databases have been widely spread in world-wide computer networks. The 
main operation for information extraction from databases is associative search. 
A symbolic associative search function is widely used as a method of associa- 
tive search. It is difficult for this function to extract semantically related infor- 
mation which has the same or similar meaning with different representations 
(Kolodner 1984, Krikelis et al. 1994, Potter 1992). 

A number of associative search methods have been proposed for realiz- 
ing efficient information extraction in database and knowledge base systems 
(David et al. 1982, Krikelis et al. 1994, Potter 1992). In the previously pro- 
posed methods, relationships between data items are represented explicitly 
by using information of connections. The relationships are extracted by using 
simple pattern matching operations and pursuing the information of the con- 
nection (Kolodner 1984). In those methods, from the view point of semantic 
representation of data items, the meanings of the data items are definitely 
fixed, and those data items are used as information with the fixed meanings. 
We consider that the relationships between data items vary in the response 
of situations or contexts. As a model for measuring the semantic relation- 
ship dynamically between data items with recognizing the context, we have 
designed a mathematical model of meaning (Kitagawa et al. 1993, Kiyoki et 
al. 1994, Kiyoki et al. 1995). 

The symbolic associative search function is effective when the symbolic pat- 
terns of retrieval target data items are unambiguously defined. This function 
is widely used in information extraction in database systems. The semantic 
associative search function by the mathematical model of meaning dynam- 
ically computes semantic correlations between a given context and retrieval 
candidate data items by semantic operations. This function is used for ex- 
tracting semantically related information to a context given by a searcher. 
The main feature of the mathematical model of meaning is that the semantic 
associative search is performed in the orthogonal semantic space. This space 
is created for dynamically computing semantic correlations between the given 
context and retrieval candidate data items. 

In this paper, we present a new associative search method with the in- 
tegrated functions of the symbolic and semantic associative search. In this 
method, appropriate data items are filtered by the pattern-matching-based 
symbolic associative search function and those data items are semantically 
put in order by the semantic associative search function. This method re- 
alizes highly functional information extraction from databases. We have de- 
signed this system as a heterogeneous information processing system in a 
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multidatabase environment (Bright et al. 1992, Litwin et al. 1990, Sheth et 
al. 1990). We have designed an integration function for combining the symbolic 
associative search function and the semantic associative search function in the 
metalevel architecture. By introducing this function, the functional integra- 
tion between symbolic and semantic associative search functions is realized 
without modifying the implementations of those functions. 

Several information retrieval methods, which use the orthogonal space cre- 
ated by the mathematical procedures like SVD (Singular Value Decomposi- 
tion), have been proposed (e.g. the Latent Semantic Indexing method 
(Deerwester et al. 1990) ). Our semantic associative search method is essen- 
tially diflFerent from those methods using the SVD. The essential difference 
is that our method provides the important function for semantic projections 
which realizes the dynamic recognition of contexts. That is, in our method, the 
context-dependent interpretation is dynamically performed for computing the 
correlations between a given context and retrieval candidate data items by se- 
lecting a subspace from the entire orthogonal semantic space. In our method, 
the number of phases of the contexts is almost infinite (currently approx- 
imately). Other methods do not provide the context dependent interpretation 
for computing equivalence and similarity in the orthogonal space, that is, the 
phase of meaning is fixed and static in those methods. 

To compute the semantic relationships between data items, several fuzzy re- 
lational database systems have been proposed (Raju et al. 1988, Rundensteiner 
1989). In comparison to the fuzzy relational database systems, the essential 
difference of our system is that our system eliminates ambiguity in semantic 
operations by introducing the concept on context recognition. 



The mathematical model of meaning is a new model for realizing the se- 
mantic associative search and extracting semantically related information by 
giving context words. This model can be applied to extract media data items 
by giving the context words which represent the impression and contents of 
the media data items(Kiyoki et al. 1994). 

The mathematical model of meaning consists of: 

1) A set of m words is given, and each word is characterized by n features. 
That is, m by n matrix is given as the data matrix. 

2) The correlation matrix with respect to the n features is constructed. 
Then, the eigenvalue decomposition of the correlation matrix is computed and 
the eigenvectors are normalized. The orthogonal semantic space is created as 
the span of the eigenvectors which correspond to nonzero eigenvalues. 

3) Media data items and context words are characterized by using the spe- 
cific features(words) and representing them as vectors (The sequence of con- 
text words are used to represent a context). 

4) The media data items and context words are mapped into the orthogonal 
semantic space by computing the Fourier expansion for the vectors. 5) A set 
of all the projections from the orthogonal semantic space to the invariant sub- 
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Figure 1 An overview of the associative search process 



spaces (eigen spaces) is defined. Each subspace represents a phase of meaning, 
and it corresponds to a context or situation. 

6) A subspace of the orthogonal semantic space is selected according to 
the user’s impression or the content of media data items, which is given as a 
context represented by a sequence of words. 

7) The closest media data item to the context representing in the user’s 
impression and the contents of media data items is extracted in the selected 
subspace. 



2 THE ASSOCIATIVE SEARCH PROCESSES 



2.1 Associative search procedure 

The associative procedure consists of two steps as shown in Figure 1. 



Step 1 : Filtering the data items by the symbolic associative search. 

Step 2 : Ordering the selected data items by the semantic associative search. 



To realize these steps, our method provides three basic functions. 

In this method, it is assumed that each data item is identified by an iden- 
tifier commonly shared between the symbolic and the semantic associative 
search functions (Figure 2). In the symbolic associative search function, the 
metadata items of the media data (e.g. media name, authors, created date) 
are represented in the form of the relational database. In these semantic as- 
sociative search function, vectors corresponding to media data items(media 
data vectors) are stored on the semantic space. 
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2.2 Punction-1: symbolic associative search 

It is the best way for performing symbolic associative search to use a relational 
database system. By using the selection operation of the relational database 
system, we can select the data items with the same pattern as a keyword and 
obtain a set of identifiers of the selected data items. 



2.3 Function-2: semantic associative search 

In the mathematical model of meaning, semantic associative search is to ex- 
tract the data item with the highest correlation to a given context from the 
specific data item set. The procedures of these steps are as follows: 

Step-1 : Context recognition 

Given a sequence of context words for determining the context, the recog- 
nition of the context is performed by the method described in the Section 
3. 

Step- 2 : Ordering of the data items: 

The data item with the highest correlation to the given context is selected 
from the retrieval candidate data items. That is, according to the given 
context, the data item with the highest correlation to the context words 
is selected from the retrieval candidate data item set. This selection is 
repeatedly performed, and the data items are put in order, according to 
their correlations to the given context. 



2.4 Function-3: integration of associative search functions 

This integration function is used for integrating symbolic and semantic as- 
sociative search functions indirectly. This function enables the symbolic and 
semantic associative search functions to be combined through this function as 
the meta-level function. 

The procedure is as follows: 

Step-1 : Selection by the symbolic associative search: 

The data items in the source relation are filtered by the symbolic asso- 
ciative search function. As the result, the integration function obtains the 
identifiers of the selected data items. The integration function gives the 
identifiers of selected data items to the semantic associative search func- 
tion. 

Step-2 : Ordering by the semantic associative search: 

The semantic associative search function receives the set of the identifiers. 
The identifiers are ordered by the semantic associative search function and 
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Figure 2 Commonly shared identifiers 



the ordered list is returned to the integration function. The integration 
function orders and outputs the tuples of the relation according to the 
ordered identifiers obtained by the semantic associative search function. 



3 FORMALIZATION OF SEMANTIC ASSOCIATIVE SEARCH 

In this section, we review the mathematical model of meaning (Kitagawa et 
al. 1993, Kiyoki et al. 1994, Kiyoki et al. 1995), which is the basic model for 
the semantic associative search function. 



3.1 Creation of a semantic space 

The semantic associative search is realized by the mathematical model of 
meaning (Kitagawa et al. 1993, Kiyoki et al. 1994, Kiyoki et al. 1995) which 
we have proposed. For the data items for space creation, a data matrix M 
is created. When m data items for space creation are given, each data item 
is characterized by n features (/i, / 2 , • • * > /n)- For given di(i = 1, • • • , m), the 
data matrix M is defined as the m x n matrix whose i-th row is d^. Then, 
each column of the matrix is normalized by the 2-norm in order to create the 
matrix M. 

Figure 3 shows the matrix M. That is M = (di, d 2 , da, • • • , dn)^. 



1. The correlation matrix M^M of M is computed, where represents the 
transpose of M. 
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Figure 3 Representation of semantic items by matrix M 




2. The eigenvalue decomposition of M^M is computed. 
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The orthogonal matrix Q is defined by 

Q = (qi,q2,---,qnr, 

where q^’s are the normalized eigenvectors of M^M. We call the eigenvec- 
tors ’’semantic elements” hereafter. Here, all the eigenvalues are real and 
all the eigenvectors are mutually orthogonal because the matrix M^M is 
symmetric. 

3. Defining the semantic space MVS. 



MVS := 5pan(qi,q2,---,q,/), 



which is a linear space generated by linear combinations of {qi, • ,q^}. 
We note that {qi, • • • , q^} is an orthonormal basis of MVS. 



3.2 The set of the semantic projections Hi/ 

The projection P\. is defined as follows: 

P^. Projection to the eigenspace corresponding to the eigenvalue A^, 
i.e. Pxi : MVS — > 5pan(qi). 

The set of the semantic projections H^^ is defined as follows: 



n, := 
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{ 0 , Px^,P\2, - , Pk 5 

P\i + P\2^PXi + • ' ^P\u-1 + Pk^ 



P\l + P\2 + ^ PXu}‘ 

The number of the elements of Uj^ is 2*^, and accordingly it implies that 2^ 
different phases of meaning can be expressed by this formulation. 



3.3 Semantic operator 

The correlations between each context word and each semantic element are 
computed by this process. The context word is used to represent the user’s 
impression and the contents for media data items to be extracted. A sequence 



se = (ui,U2,---,u^) 



of £ context words and a positive real number 0 < Sg < 1 are given, the 
semantic operator Sp constitutes a semantic projection Pe^{se), according to 
the context. That is, 

Sp : Ti I — > Hi, 

where Te is the set of sequences of £ words and 3 3 P^^{si). Note 

that the set {ui,U2,*--,ti^} must be a subset of the words defined in the 
matrix M. 

The constitution of the operator Sp consists of the following processes: 

1. Fourier expansion of Ui(i = 1, 2, • • • , 

The inner product of and qj Uij is computed, i.e. 

Uij (Ui, qj) , /or j = 1, 2, • • ■ , I/. 

We define Ui G I as 

Ui := (Uii, * * • ) 



This is the mapping of the context word to the semantic space MVS. 
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2. Computing the semantic center of the sequence S£. 






li Iloo 



where || • ||oo denotes infinity norm. 

3 . Determining the semantic projection Pe^{se). 

If the sum for a semantic element is greater than a given threshold £5 , we 
employ the semantic element to form the projected semantic subspace. We 
define the semantic projection by the sum of such projections. 



Pesi^i) ^ 

iCih.Es 



where := { i | | {G+{se))^ \ > ej. 



3.4 The Creation method of metadata for media data items 

The media data item (e.g. image data item) P consists of t objects (or impres- 
sion words) 0i,02, . . . ,Ot, where each object is defined as an n dimensional 
vector: 



Oi {pi\i 0i2i • • • 5 



which is characterized by t specific features. 

Namely, we define the media data item P as the collection of t objects (or 
impression words). 



P= {01,02,..., Ot}. 



Moreover, we define the operator union 0 of objects 01,02,...,©^, to repre- 
sent the metadata for the media data items P as a vector as follows: 



t 

0Oi = (sign(o^ii) max |oa|, sign (0^22) max |oi2|, 

l<i<t 



. . . , sign(o«„„) max | t^in I ) 5 

l<t<i 
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where sign(a) represents the sign (plus or minus) of “a” and ik,k = 
represents the index which gives the maximum, that is: 

max Joifel = | 04 fc|. 

Kt<t 



3.5 The expression for semantic associative search 

We introduce an expression to measure the correlation between the context 
words (keywords) and the media data items. This expression measures the 
correlation between a set of context words and each retrieval candidate media 
data item. 

We can regard the set of context words as the words forming the context 
S£. We can specify semantic subspaces with weights c/s. Since the norm of a 
media data item, which can be calculated from the metadata of the media data 
item, reflects the correlation between the media data item and the semantic 
elements included in each selected subspace, we may use it as the measure for 
the correlation between the given context and each media data item. 

The expression fjo (x; se) for computing the norm of a media data item, 
in which we eUminates the effect of the negative correlation by omitting the 
corresponding terms, is defined as follows: 

rjo (x; Si) = rr-j , 

l|X||2 

where the set S is defined hy S = {i\sign{ci{se) = sign{xi)} and the weight 
Cj{si) is given as follows: 




3.6 The semantic associative search algorithm 

The semantic associative search function realizes context-dependent interpre- 
tation. This function performs the selection of the semantic subspace from the 
semantic space(A4I>5). When a sequence S£ of context words for determin- 
ing a context are given to the system, the selection of the semantic subspace 
is performed. This selection corresponds to the recognition of the context, 
which is defined by the given context words. The selected semantic subspace 
corresponds to a given context. In the selected semantic subspace, the media 
data item with the highest correlation to the given context is obtained by the 
expression fjo (x; se) defined in Section 3.5. 
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This semantic associative search is performed by the following procedure: 

Step-1 When a sequence se of the context words for determining a context 
(representing the user’s impression and the contents of media data item) 
are given, the Fourier expansion is computed for each context word, and the 
Fourier coefficients of these words with respect to each semantic element are 
obtained. This corresponds to seeking the correlation between each context 
word and each semantic element. 

Step- 2 The values of the Fourier coefficients for each semantic element are 
summed up to find the correlation between the given context words and 
each semantic element. 

Step-3 If the sum obtained in Step-2 in terms of each semantic element is 
greater than a given threshold Ss, the semantic element is employed to form 
the semantic subspace Pe^{si)MVS. This corresponds to the recognition 
of the context. 

Step-4 By using the expression fjo (x;s^), the metadata item for the media 
data item with the highest correlation to the context is selected among the 
candidate metadata items for the media data set in the selected semantic 
subspace. This corresponds to finding the media data item with the highest 
correlation of the given context. 



4 DATA STRUCTURES AND PRIMITIVE OPERATORS 

In this section, we describe the data structures and the primitive operators 
for implementing our method. 



4.1 Data structures and primitive operators in the 
symbolic associative search function 

In the symbolic associative search function, the data structure and several 
primitive operators of the relational database system are used. We use the' 
“relation” as the data structure. 

We define the set of primitive operators in the symbolic associative search 
as follows: 

• (select [rel] [att] [cond] [val]) 

• (project [rel] [att-list]) 

• (join [rell] [attl] [rel2] [att2] [cond]) 

• (imion [rell] [rel2]) 

• (diff [rell] [rel2]) 

(The parameters “rel, rell, rel2” : relations, “att, attl, att 2” : at- 
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tributes, “att-list” : a list of attributes, “cond” : a condition, and “val” : 
a keyword value.) 



4.2 Data structures and primitive operators of the 
semantic associative search function 

In this section, we describe the data structure and the primitive operator on 
the semantic associative search function. 

In the semantic associative search function, a vector corresponds to a media 
data item. In the primitive operator of semantic associative search, a context, 
a set of vectors of retrieval candidate media data items, and the number of 
return values are given as parameters. 

We define the primitive operator as follows: 

• (semantic-search [context] [target] [maxresult] ) 

(The parameter “context” : a context, “target” : a set of retrieval candi- 
date vectors, and “maxresult” : the number of return values.) 



4.3 Data structures and primitive operators in the 
associative search method 

In this section, we describe the data structure and the primitive operator in 
the integrated system. 

As the data structure, we use “relation” in the symbolic associative search 
function and a set of vectors of retrieval candidate media data in the semantic 
associative search function. As the assumption, the common identifiers are 
shared to recognize the same objects between those functions. 

This primitive operator is the higher level operator of the semantic associa- 
tive search function (in the previous section). The parameters of the primitive 
operator are a relation, context, a set of vectors of retrieval target media data, 
a semantic space for semantic associative search, the number of results n, and 
names of the attributes which are added. The return value of the primitive op- 
erator is the relation with the added attributes. The attributes of the ranking 
and the norms are included. 

• (search-mediadata-by-context [rel] [ID] 

[attl] [att2] [space] [target] [user] 

[maxresult] [context] ) 



(The parameters “rel” : an input relation, “ID” : an attribute name of 
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Figure 4 The implementation of the experimental system 



the identifier, “attl, att2” : names of the attributes, “space” : a semantic 
space, “target” : a set of vectors of retrieval target media data, “user” : a 
user’s dictionary, “maxresult” : the number of results, and “context” : a 
sequence of context words.) 

The procedure of this primitive operator is as follows: 



Step-1 Measures correlations in the semantic space by Function-2 of Section 
2.3 with a given context, a set of vectors of retrieval candidate media data, 
a semantic space, and the number of results n. 

Step-2 Orders the media data items according to the correlation, and obtains 
the top n data items (The ranking and norms of the n vectors are obtained). 

Step-3 Adds attributes of the ranking and norms in the input relation. 

Step-4 Inserts the attribute values of ranking and norms into the input re- 
lation. 



5 IMPLEMENTATION OF THE SYSTEM 

In this section, we present the implementation of the proposed associative 
search method. 

We have implemented the associative search method by the system con- 
sisting of three modules (Figure 4) . They are named the symbolic associative 
search subsystem, the semantic associative search subsystem, and the meta- 
level system. 
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5.1 Implementation of the symbolic associative search 
subsystem 

We have implemented the symbolic associative search (pattern matching) 
method by using the relational database system. This subsystem is imple- 
mented with the primitive processor and the database management system 
(DBMS), as shown in Figure 4. 

The primitive processor is the mediator between the meta-level system and 
the DBMS. By the uniform access through primitives, our system realizes the 
abstraction of multiple DBMSs. Relations are used as the data structure. 



5.2 Implementation of semantic associative search 
subsystem 

We have implemented an experimental system of the semantic associative 
search function. To create a data matrix M automatically, we have referred 
to the English dictionary named “General Basic English Dictionary (Ogden 
1940)” in which only 871 basic words are used to explain every English vocab- 
ulary entry. Those basic words are used as features, that is, they are used as 
the features corresponding to the columns in the data matrix M. Namely, 871 
features are provided to make the semantic space. And, 2115 words are used 
to represent the words corresponding to the rows in the data matrix M. These 
2115 words have been selected as the basic vocabulary entries. These entries 
are the same as the basic explanatory words used in the English dictionary 
named “Longman Dictionary of Contemporary English (Longman 1987).” The 
2115 X 871 data matrix is used to create the semantic space. 

By using this matrix, the semantic space is created. Context words and 
retrieval candidate media data items are mapped into this space. Furthermore, 
each basic word corresponding to vocabulary entries is mapped to the semantic 
space by the Fourier expansion. The procedure for creation of the semantic 
space is as follows: 

1. Each of 2115 vocabulary entries corresponds to a row of the matrix M. 
In the setting of a row of the matrix M, each column corresponding to 
the explanatory words (features) which appear in each vocabulary entry 
is set to the value “1”. If the explanatory word is used as the negative 
meaning, the column corresponding to the word (feature) is set to the 
value “-1” . The column corresponding to the vocabulary entry itself is set 
to the value “1”. And, the other columns are set to the value “0”. This 
process is performed for every vocabulary entry. And then, each colunin of 
the matrix is normalized by the 2-norm to create the matrix M. 

2. By using this matrix M, an semantic space is computed as described in 
Section 3. 
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To create the data matrix M from the dictionary automatically, we have 
implemented several filters which remove unnecessary words, such as articles 
and pronouns, and transform conjugations and inflections of words to the 
infinitives. The unnecessary words are not used as features in the data matrix 
M. 

Each English word is mapped into this semantic space. We used the simple 
words which appear in the dictionary itself. We have performed several exper- 
iments using this semantic space to clarify the effectiveness of our method. 



5.3 Implementation of the meta-level system 

We have implemented the meta-level system consisting of the query interpreter 
and the query processor (Figure 4). The query interpreter receives a searcher’s 
(user’s) query and translates it into the sequence of primitives. And then, it 
sends the sequence to the query processor. The query processor distributes 
primitives to the subsystems. 

The query processor is currently implemented on the UniSQL(UniSQL 
1995), which is the extended system on object-orientation in the relational 
database system. 



6 EXPERIMENTS 

We performed several experiments to clarify the feasibility of our associative 
search method. We have made it clear that the method based on symbolic 
filtering and semantic ordering realizes the advanced associative search for 
media data. 



6.1 Experimental environment 

We have implemented the experimental system in C language and ESQL/X 
language(UniSQL 1995). As the platform, we have used Sun SparcStation 5 
and Sun SparcStation EC (SunOS 4.1.4). 



6.2 Experiment-1 

In this experiment, the 30 image data items are used as media data. The 
image database is named “famousimages” and listed as shown in Figure 5. 
The metadata (impression words) of these images are given as shown in Figure 
6. The query used in Experiment-1 is shown in Figure 7. The data items with 
“Hokusai” as the author are selected by the selection operation. And then. 
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those data items are ordered, according to the given context “power, force'"’, 
by the semantic associative search subsystem. This query means “Select the 
Hokusai’s images, and put them in order, according to the context ‘power’ 
and ‘force’. ” The results are listed as shown in Figure 8. 



6.3 Experiment-2 

As Experiment-2, the query is shown in Figure 9. This query means “Select 
images which were painted after 1 950, and put them in order, according to the 
context ‘light’ and ‘bright’. ” By the selection operation, the images painted 
after 1950 are selected, and then, they are ordered, according to the context 
“light, bright”, by the semantic associative search function. The results are 
listed as shown in Figure 10. 



6.4 Experiment-3 

As Experiment-3, the query shown in Figure 11 is issued to the system. This 
query means “Select the images which were painted after 1990, and put them 
in order, according to the context ‘light’ and ‘bright’. ” The images painted 
after 1990 are selected and ordered, according to the context “light, bright.” 
The results are shown in Figure 12. 

After the selection by the pattern matching, data items are put in order, 
according to the given context by the semantic associative search. These ex- 
periments have clarified the feasibility and the advantage of our associative 
search method with symbolic filtering and semantic ordering functions. 



7 CONCLUSION 

In this paper, we have proposed the new associative search method with the 
symbolic filtering and semantic ordering functions. The symbolic associative 
search is based on the simple pattern-matching-based function. This function 
is used as the information filter which repeatedly executes pattern-matching- 
based comparisons between data items. The semantic associative search func- 
tion extracts semantically related information by the mathematical semantic 
operations based on the mathematical model of meaning. 

We have also presented the implementation method of the associative search 
method in the multidatabase environment. This implementation method makes 
it possible to integrate the existing subsystems for symbolic and semantic as- 
sociative search under the meta-level system. 

As our future work, we will realize a learning mechanism for adapting the 
information extraction according to individual variation. We will also consider 




An associative search method for database systems 



121 



analytical evaluation of our associative search method. Furthermore, we will 
study automatic metadata creation(Kashyap et al. 1996) from media data, 
for designing an advanced multimedia database system. 
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ID 


Title 


Author 


Year 


chagalll 


An_Atelier 


Chagall 


1910 


chagall2 


A_Grey_House 


Chagall 


1917 


chagallS 


A_Gate 


Chagall 


1917 


chagall4 


A .Window 


Chagall 


1917 


corotl 


Genor.City 


Corot 


1834 


corot2 


La_Lochell 


Corot 


1851 


corot3 


La.Chelvalla 


Corot 


1830 


corot4 


Wood 


Corot 


1846 


goghl 


Dovinie 


Gogh 


1888 


gogh2 


Valley 


Gogh 


1888 


hirol 


Cereta 


Hiro 


1990 


hiro2 


AJStreet 


Hiro 


1991 


hiro3 


Concord 


Hiro 


1992 


hiro4 


Louvre 


Hiro 


1990 


hiro5 


Venice 


Hiro 


1990 


hokusail 


Kanagawa 


Hokusai 


1829 


hokusai2 


Tagojio.ura 


Hokusai 


1829 


hokusai3 


Misaka 


Hokusai 


1829 


hokusai4 


Sekiya 


Hokusai 


1829 


loirandl 


The.Harbour 


Loirand 


1990 


loirand2 


Near _B roue 


Loirand 


1990 


loirand3 


A -Farm 


Loirand 


1990 


loirand4 


A_Stone_Pavement 


Loirand 


1990 


nelson 1 


pursuit 


Nelson 


1977 


nelson2 


whaling-ships 


Nelson 


1980 


renoirl 


Venise.brouillard 


Renoir 


1881 


renoir2 


Garden_at_Cortot 


Renoir 


1876 


renoir3 


La_Casbah 


Renoir 


1881 


renoir4 


Estaque 


Renoir 


1882 


sarthoul 


Sea 


Sarthou 


1980 



Figure 5 The database used in experiments (f amousimages) 
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ID 


features 


chagalll 


vivid quiet substance 


chagall2 


grief terrible sombre 


chagall3 


sober dynamic motion 


chagall4 


shine tender calm 


corotl 


beautiful grand calm 


corot2 


beautiful delicate calm 


corotS 


grief sombre sober 


corot4 


shine beautiful calm 


goghl 


merry delight shine 


gogh2 


grief terrible sombre 


hirol 


twilight grand quiet 


hiro2 


cheer dim quiet 


hiro3 


beautiful quiet calm 


hiro4 


fine shine beautiful 


hiro5 


fine beautiful calm 


hokusail 


dynamic strong motion 


hokusai2 


fight motion calm 


hokusai3 


delicate calm quiet 


hokusai4 


vivid motion speed 


loirandl 


shine grand calm 


loirand2 


delight shine calm 


loirand3 


delight grand calm 


loirand4 


quiet substance material 


nelson 1 


grand dynamic motion 


nelson2 


twilight calm quiet 


renoirl 


dim tender quiet 


renoir2 


delight dim calm 


renoir3 


loud bustle crowd 


renoir4 


fine strong quiet 


sarthoul 


dynamic motion speed 



Figure 6 Matadata of images in experiments 
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(project 

( s ear ch-mediadata-by- context 

(select famous images ’Author ’== ’Hokusai ) 
ID ’rank ’norm space famousimages user 
4 ’ (power force) ) 

’ (Rank ID norm) 

) 



Figure 7 The query for Experiment- 1 



Rank 


ID 


norm 


1 


hokusail 


0.291638 


2 


hokusai2 


0.231356 


3 


hokusai4 


0.221350 


4 


hokusaiS 


0.135075 



Figure 8 The result of Experiment- 1 



(project 

( s e ar c h-medi adat a-by - c ont ext 

(select famousimages ’Year ’> ’I960 ) 
ID ’rank ’norm space famousimages user 
30 ’ (light bright) ) 

’ (Rank ID norm) 

) 



Figure 9 The query for Experiment-2 
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Rank 


ID 


norm 


1 


hiro4 


0.340773 


2 


loir and 1 


0.331999 


3 


loirand2 


0.331126 


4 


hirol 


0.286002 


5 


nelson2 


0.267768 


6 


loirand4 


0.249276 


7 


hiro3 


0.231995 


8 


sarthoul 


0.223989 


9 


nelson 1 


0.223900 


10 


hiro5 


0.206538 


11 


hiro2 


0.205272 


12 


loirand3 


0.204344 



Figure 10 The result of Experiment-2 



(project 

(search-mediadata-by-context 

(select famous images 'Year '>= '1990 ) 
ID 'rank 'norm space famousimages user 
30 ' (light bright) ) 

' (Rank ID norm) 

) 



Figure 11 The query for Experiment-3 
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Rank 


ID 


norm 


1 


hiro4 


0.340773 


2 


loir and 1 


0.331999 


3 


loirand2 


0.331126 


4 


hirol 


0.286002 


5 


loirand4 


0.249276 


6 


hiro3 


0.231995 


7 


hiro5 


0.206538 


8 


hiro2 


0.205272 


9 


loirandS 


0.204344 



Figure 12 The result of Experiment-3 
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Abstract 

While database reverse engineering is getting mature, trying to recover the 
semantics of recent OO applications seems to trigger little interest. The reason is 
that the problem is underlooked because OO programs should be written in a clean 
and disciplined way, and based on state-of-the-art technologies which allow 
programmers to write auto-documented code. The paper is an attempt to explain 
why the reality is far from this naive vision. Mainly through a small C++ case 
study, it puts forward the main problems that occur when trying to understand 
actual OO applications. The example is processed through a generic reverse 
engineering methodology which applies successfully to OO programs, thanks to 
logical and conceptual OO models that can precisely describe object structures at 
any level of abstraction. As a synthesis of this case study, the paper discusses the 
techniques and tool support that are needed to help analysts in reverse engineering 
the object structures of OO applications. 

Keywords 

database, data reverse engineering, methodology, object-oriented applications, 
object-oriented specification, semantics elicitation 
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L INTRODUCTION 

While database reverse engineering is getting mature, as witnessed by the 
increasing number of conferences on the topic, trying to recover the semantics of 
recent 00 applications seems to trigger little interest. The reason is that the 
problem is underlooked because 00 programs are supposed to be written in a 
clean and disciplined way, and based on state-of-the-art technologies which allow 
programmers to write code that is auto-documented, easy to understand and to 
maintain. It quickly appears that the reality is far from this naive vision, as we will 
argue in this paper. 

So far, the term 00 reverse engineering has been given two distinct 
interpretations: namely building an 00 description of a standard application and 
building/recovering an 00 description of an 00 application. The objective of this 
paper is to contribute to the solving of the second kind of problems. However, it is 
worth discussing the goal and problems of both approaches since they share more 
than we can think at first glance. 

1.1 Building an OO description of a non-OO application 

According to the first interpretation, a standard (typically 3GL) application is 
analyzed in order to build an 00 description of its data objects and of as many as 
possible parts of its procedural components. A typical overview of a reverse 
engineering project following this approach consists in finding potential object 
classes and their basic methods. For example, a COBOL business application 
based on files CUSTOMER, ITEM and ORDER will be given a description 
comprising Customer, Items and Order classes, with their associated methods such 
as RegisterCustomer, DropCustomer, ChangeAddress, Sendinvoice, etc. The 
initial idea is quite simple (Sneed, 1996): 

• each record type implements an object class and each record field represents a 
class attribute; 

• the creation, destruction and updating methods (e.g. RegisterCustomer, 
DropCustomer, ChangeAddress) can be discovered by extracted and reordering 
the procedural sections that manage the source records; 

• the application methods (e.g. Sendinvoice) can be extracted by searching the 
code for the functional modules. 

This idea has been supported by much research effort in the last years (Gall, 
1995), (Sneed, 1995), (Yeh, 1995), (Newcomb, 1995). Unfortunately, it proved 
much more difficult to implement than originally expected. Indeed, the process of 
code analysis must take into account complex patterns such as near-duplication 
(near-identical code sections duplicated throughout the programs (Baker, 1995)), 
interleaving (a single code section used by several execution flows (Rugaber, 
1995)) and runtime-determined control structures (e.g. dynamically changing the 
target of a goto statement or dynamic SQL). Some authors even propose, in some 
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situations, to leave the code aside, and to reuse the data only (Sneed, 1996b), 
among others through wrapping techniques based on the CORBA model. In 
addition many extracted modules appear to cope with the management of several 
record types, i.e. with more than one potential object class. This latter problem 
forces the analyst to make arbitrary choices, to deeply restructure the code, or to 
resort to some heuristics (Penteado, 1996). 

Several code analysis techniques have been proposed to examine the static and 
dynamic relationships between statements and data structures. Dataflow graphs, 
dependency graphs and program slicing are among the most popular. In particular, 
the concept of program slicing, introduced by M. Weiser (Weiser, 1984), seems to 
be the ultimate solution to locate the potential methods of the record/classes of a 
program. 

1.2 Building an OO description of an OO application 

The second interpretation can be read reverse engineering of OO applications. 
Quite naturally, the result will be expressed as OO specifications. The problem is 
of course different, and fortunately a bit easier, though many standard problems 
will have to be coped with. 

The basic idea is straightforward: the class definitions are parsed and abstracted 
in a higher-level, generally graphical, OO model*. The schema that is obtained in 
this way comprises object classes with attributes, inheritance hierarchies and 
methods. Unfortunately, this schema is far from satisfying. Indeed, most OO 
applications are written with low-level languages as far as the object semantics is 
concerned. For instance, Smalltalk, C++ and current 00-DBMS lack many 
important constructs and integrity constraints that should be necessary to express 
essential properties of the application domain to be described. Let us mention four 
of them. 

• Object collections. Some languages do not propose the concept of set of 

objects that collects all the (or some) instances of a class. This concept must be 
simulated by the programmer in ways that are not standardized: built-in 

or home-made constructor, record arrays, files, chains, pointer arrays, bit-maps, 
etc. 

• Object relationships. Inter-object associations are not widespread yet, at least 
in current implementations. They can be implemented in various ways: 
complex attributes, foreign keys, embedded objects, references (oid) to foreign 
objects, pointer arrays, chains, etc. Redundant implementations through which 
bidirectional access is available can be controlled through the inverse 
constraint. 



* Many tools, such as the integrated development environments, include object browsers that offer 
this functionality. 
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• Identifiers. An identifier, or key, is an attribute (or set thereof) that designates 
a property that is unique for each instance of a class. Except in recent 
proposals (e.g. ODMG (Cattell, 1994)), this uniqueness cannot be declared, but 
must be procedurally enforced by the updating methods related to the class. 

• Cardinality constraints. In most 00 models, an attribute can be mandatory 
single-valued (only one value per instance) or multivalued (from 0 to N 
values*). Defining any more precise constraint is up to the programmer. For 
instance, asserting that a BOOK has from 1 to 5 authors can be done only by 
declaring attribute Authors as a set-of PERSON, degrading the [1-5] constraint 
into the more general [0-N]. Here again, the programmer will develop 
checking code in all the relevant program sections, most generally in the body 
of the object management methods. 

1.3 Motivation 

While the relevance of reverse engineering standard applications, typically 
CoboWsam, Cobol/Codasyl or C/Oracle, can no longer be questioned at the 
present time, applying this process to state-of-the-art programs can seem a bit 
academic. The short analysis developed above shows that the process can be 
harder than first estimated. In addition, 00 applications can be concerned with the 
same problems and evolution patterns than typical legacy applications (anyway, 
there can be C++ and Smalltalk legacy applications). We can mention four 
scenarios in which the reverse engineering of 00 applications must be carried out. 

• Redocumentation of 00 programs. There is no objective reason to believe 
than 00 applications have been developed in a more scientific way, with 
abstract models, CASE tools and complete, consistent and up-to-date 
documentation, ... than legacy applications. Hence the need to rebuild a correct 
documentation that will allow the development team to modify, maintain and 
make the application evolve (Kung, 1993). 

• Translating an application from an 00 language to another one (e.g. 
converting from to Smalltalk to C++). Since the logical object model of the 
languages are not identical, recovering an abstract specification of the source 
application is the most reliable way to build a good quality equivalent target 
application. 

• Mapping an 00 application on a relational database. This is a very popular 
reengineering technique to make C++ objects persistent. Since the semantics 
of C++ and SQL are very different, recovering the conceptual model of the 
C++ classes is required before generating the semantically equivalent SQL 
schema. 



* N standing for oo. 
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• Migrating an 00 application to a standard distributed persistant object system 
such as CORBA or ODMG (Cattell, 1994). The new standards can express in a 
declarative way much more semantics than, say, C++ or Smalltalk: 
relationships, identifiers, inverse, etc. Making these constructs explicit is a 
mandatory process before building the new schema. 

1.4 Objectives and structure of the paper 

This paper is a contribution to the process of rebuilding an abstract documentation 
of the object classes which are declared and manipulated in 00 application 
programs. We base the discussion and our proposals on the DB-MAIN approach 
and tool that have already proved useful to reverse engineer non-00 data-oriented 
applications. Section 2 is a short reminding of the components of the DB-MAIN 
generic reverse engineering methodology. The way object schemas can be 
represented in an abstract way is shown in Section 3. Section 4 is the development 
of a case study made up of a short C++ program from which we will try to 
understand the semantics of object classes. In Section 5, we summarize the main 
problems that can occur when trying to recover the abstract description of object 
classes. CASE support is discussed through some representative functions of the 
prototype DB-MAIN CASE tool (Section 6). 

2. A GENERIC DBRE METHODOLOGY 

We have proposed a general methodology that can be specialized to the various 
data models which most legacy systems are based on, such as standard files, or 
CODASYL, IMS and relational databases. This methodology is fitted to 00 
applications with few extensions. Since it has been presented in former papers 
(Hainaut, 1993), (Hainaut, 1993b), (Hainaut, 1994), (Hainaut, 1996b), (Hainaut, 
1996c), (Hainaut, 1996d), we will only recall some of its processes and the 
problems they try to solve, and that will be illustrated in Section 4. Its general 
architecture is outlined in Figure 1. 

The methodology is based on a transformational approach stating that many 
essential data engineering processes can be modelled as semantics-preserving 
specification transformations (Section 5.3). Hence the idea that reverse 
engineering can be (grossly) modeled as the reverse of forward engineering. The 
model we propose comprises two phases, namely Data structure extraction, the 
reverse of Database Physical design, and Data structure conceptualization, the 
reverse of Database Logical design, that produce the two main products of reverse 
engineering. 
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Figure 1 - Main processes of the generic DBRE methodology. 

The Data Structure Extraction Process consists in recovering the logical 
schema of the database, i.e. the complete DMS* schema, including all the implicit 
and explicit structures and constraints. It mainly consists of three distinct sub- 
processes. 

• DMS-DDL text ANALYSIS. A first-cut schema is produced through parsing 
the DDL texts or through extraction from data dictionaries. 

• SCHEMA REFINEMENT. This schema is then refined through specific 
analysis techniques (Hainaut, 1996c) that search non declarative sources of 
information for evidences of implicit constructs and constraints (e.g. 
PROGRAM ANALYSIS and DATA ANALYSIS). This is a complex process 
that was emphasized rather recently, when analysts realized that many 
important constructs and constraints are not explicitly translated into DMS 
schemas, but rather are managed through procedural section, or even are left 
unmanaged. Hence the many techniques and heuristics proposed in the 
literature (Andersson, 1994), (Petit, 1994), (Blaha, 1995), (Hainaut, 1996c) to 
try to recover these implicit constructs. In traditional, non-00, applications, the 
analysts will recover structures such as field and record hierarchical structures. 



*A Data Management System (DMS) is either a File Management System (FMS) or a Database 
Management System (DBMS). 
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identifiers, foreign keys, concatened fields, multivalued fields, cardinalities and 
functional dependencies. 

• SCHEMA INTEGRATION. If several schemas have been recovered, they have 
to be integrated. The output of this process is, for instance, a complete 
description of COBOL files and record types, with their fields and record keys 
(explicit structures), but also with all the foreign keys that have been recovered 
through program and data analysis (implicit structures). 

The Data Structure Conceptualization Process addresses the conceptual 
interpretation of the DMS logical schema. It consists for instance in detecting and 
transforming, or discarding, non-conceptual structures, redundancies, technical 
optimization and DMS-dependent constructs. It consists of two sub-processes, 
namely BASIC CONCEPTUALIZATION and CONCEPTUAL 
NORMALIZATION. 

• BASIC CONCEPTUALIZATION. The main objective is to extract all the 
relevant semantic concepts underlying the logical schema. Once the schema 
has been cleaned (PREPARATION) two different problems have to be solved 
through specific techniques and reasonings: SCHEMA UNTRANSLATION, 
through which one identifies the trace of DMS translations and one replaces 
them with their origin conceptual constructs and SCHEMA DE- 
OPTIMIZATION where one eliminates the optimization structures. 

• The CONCEPTUAL NORMALIZATION restructures the basic conceptual 
schema in order to give it the desired qualities one expects from any final 
conceptual schema, e.g. expressiveness, simplicity, minimality, readability, 
genericity, extensibility, compliance with corporate standards (Batini, 1992). 

3. REPRESENTATION OF 00 CONCEPTS 

The methodology produces specifications according to two levels of abstraction, 
namely logical and conceptual schemas. A logical schema expresses the object 
structures as they are perceived by the user or the programmer of a specific DMS. 
For instance, we will consider C++, 02 or ODMG logical schemas. A conceptual 
schema is DMS-independent, and expresses the semantics of data structures 
according to a conceptual model. At this level, we will find OMT, Coad-Yourdon 
or UML conceptual schemas (Wieringa, 1997). 

The DB-MAIN approach is based on a large-spectrum model that encompasses 
the concepts of most logical and conceptual models used in data engineering. In 
other words, each practical model can be defined as a specialization of the DB- 
MAIN generic model (Hainaut, 1989). In the following two sections, we will 
describe how logical and conceptual 00 schemas can be specified in a uniform 
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way. To simplify the presentation, we will use the graphical notation of the DB- 
MAIN generic model* to represent schemas at both levels. 



3.1 Logical OO schemas 



There is a large variety of models at this level, ranging from very basic C++ 
constructs to recent ODMG and CORBA proposals and OODBMS models. We 
must represent all of them in as much detail as needed to ultimately recover the 
semantics of the object classes. 

The definition of a class (graphically represented by a rectangle) comprises 
several parts, namely the class name, the attributes and the methods. A fourth part, 
dedicated to integrity constraints, will be introduced later on. An attribute is 
atomic or compound (tuple), single-valued or multivalued (set, bag, list, array). 




Figure 2 - Attributes and methods of object classes. 

A class can be declared a subclass of another one. Each attribute A is given a 
cardinality [I-J] stating that each parent instance (object or compound attribute 
value) has from I to J associated values, where J="N" stands for infinity. The 
domain of an atomic attribute is either a basic domain (character, string, integer, 
real, boolean, BLOB, etc.) or an object class of the schema, in which case it will be 
called object-attribute. For simplicity, the basic domains and the most frequent 
attribute cardinality (i.e. [1-1]) are not explicitly represented. In addition, only the 



* Being intended to express in a uniform way several popular data and information models, these 
graphical conventions result from trade-offs among the specific graphical representations of each of 
them. For instance, the representations of an entity type and of a conceptual object class are similar. 
In the same way, record types and logical object classes are given similar graphical representations. 
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name of the methods are specified. The schema of Figure 2 includes four object 
classes. Class DEPARTMENT has three attributes, namely Name (basic domain, 
atomic, single-valued, mandatory). Location (compound) and Employees. The 
latter is multivalued (set) and its domain is the EMPLOYEE object class, each of 
its values being an arbitrary large (and possibly empty) set of EMPLOYEE 
instances. The class has only one method, nbr-of-employees. The basic methods 
such as the constructors, destructors, modifiers and accessors are supposed to be 
defined, but has not been shown for simplicity. Class EMPLOYEE includes two 
mandatory attributes fEmp# and Origin), an optional single-valued attribute 
(Function) and a compound, multivalued (list) attribute (History) with cardinality 
[0-20]. Origin is a single- valued object- attribute. In addition, this class inherits 
attributes Name and Address from class PERSON. 

These concepts offer a general way to model complex object types built through 
recursive application of the standard constructors tuple and collection-of (i.e. set- 
of, bag-of, list-of or array-of). Though some advanced models comprise the 
concept of inter-object relationships (e.g. ODMG and CORBA), we will describe it 
in the conceptual part of the generic model. 

Simple object models comprise few (if any) integrity constraints, while more 
advanced models propose at least identifiers, or keys, and inverse object-attributes. 

• Class identifiers. A set of attributes form a class identifier (sometimes called 
key) if, at any time, no two objects can share the same values for these 
attributes. As in relational schemas, a class can have at most one primary 
identifier (id), and any number of secondary identifiers (id"). In Figure 3, 
Name is the primary id of class DEPARTMENT, while Employees is a 
secondary id. An id is single-valued if it comprises single-valued attributes 
(e.g. DEPARTMENT.Name). It is multivalued when it is made of one 
multivalued attribute. In the latter case, no two instances can share each of the 
attribute values. For example, DEPARTMENT.Employees is declared a 
multivalued id (id: Employees ["^]), which translates the fact that an employee 
cannot be employed in more than one department. 

• Inverse object-attributes. DEPARTMENT.Employees and EMPLOYEE.Origin 
are declared inverse, indicating that the Origin of an employee is the 
department of which s/he is one of the Employees, and vice-versa. Note that 
the schema indicates that Employees is both an identifier (id ' ) and an inverse 
attribute (inv). 

We will describe some additional constraints that are commonly found in object 
schemas, though they sometimes are not explicitly declared. 

• Attribute identifiers. A compound multivalued attribute A can be given an 
internal identifier, which is made of a subset I of its components. For each 
parent instance of A, the values (tuples) of A have distinct values for I. In 
Figure 3, the History tuples of each EMPLOYEE instance have distinct Year 
values (id( History): Year). 
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• Subtype constraints. The set of subclasses of a given class can be constrained 
to satisfy set properties: (1) the disjoint constraint (symbol D) prevents any two 
subclasses to share common instances, otherwise, the classes can overlap, (2) 
the total (symbol T) constraint imposes any superclass instance to fall in at 
least one subclass. Subclasses that are both disjoint and total form a partition 
(symbol P). 

• Other constraints. Any property that is relevant in other models can be defined 
on object schemas as well. Such is the case of the foreign keys, that can be 




Figure 3 - Integrity constraints: class identifiers, attribute identifiers, inverse 
object-attributes, subtype constraints. 

3.2 Conceptual OO schemas 

A conceptual schema is supposed to be DMS-independent, and to offer an abstract 
view of technical data structures. According to the most popular models (e.g. 
Coad-Yourdon, Booch, Merise-00, OMT, Fusion, UML (Wieringa, 1997)), the 
main aspect of conceptual object models, at least as far as structural aspects are 
concerned, is the absence of object-attributes and the introduction of relationship 
constructs (though the latter can be found in some recent logical models). We will 
discuss how to represent the relationship types. For genericity reasons, we still use 
the DB-MAIN notation. Though it can appear different from the notation of each 
OO model, it correctly expresses the main aspects of all of them, and is more than 
adequate for the purposes of reverse engineering. 
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id: .EMPLOYEE 
Year 

Figure 4 - A conceptual object schema. 

A relationship type (rel-type) R has an optional name, comprises at least two roles 
and can have attributes. A role has an optional name and is taken by one or several 
object classes. It is submitted to cardinality constraints [I-J] that states in how 
many (from I to J) relationships any class instance can appear in this role*. In 
Figure 4, rel-type (CLIENT, EMPLOYEE) has name Responsible, while the other 
two rel-types are unnamed. In the same way, some roles (e.g. is-origin-of) are 
named while others are not. The flexibility of these naming conventions is a 
necessity for a generic model intended to comply with different operational models. 

The identifier of an object class can now comprise attributes, but also roles, as 
illustrated by object class History which is identified by its Year value among all 
the instances associated with one EMPLOYEE instance. Such a construct is 
sometimes called weak object/entity type. A N-ary relationship type can have 
identifiers and attributes too. 



* Attention: this interpretation, which is compliant with such models as Merise or Batini at aL 
(Batini,92) is the converse of that of, say, OMT. However, together with the concept of relationship 
identifier, it encompasses the cardinality concepts of all the other models. 
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sometimes called weak object/entity type. A N-ary relationship type can have 
identifiers and attributes too. 



4. A SHORT CASE STUDY 

We will discuss the main concepts developed so far through a small case study 
based on the C++ program presented in Appendix, the goal of which is the 
management of information concerning customers, orders and products. This 
application is incomplete and unrealistic (data are not saved on program closure 
for instance), but it is sufficient to illustrate both the kinds of problems that 
actually occur in practice and the reasoning that can solve them. Most of the 
processes mentioned in Section 2 apply, and will be discussed in some detail. 

4.1 The DMS-DDL text ANALYSIS process 



The C++ analyzer recognizes the class definitions: class name, superclasses, types, 
attributes and methods (Figure 5). Pointer to class instances (e.g. customer 
*next) are abstracted as object-attributes (e.g. next : ^customer). Arrays[I] 
are expressed as array-multivalued attributes with cardinality [I -I]. All the 
attributes are supposed to be mandatory. This process is a mere representation 
translation and uses mere parsing techniques. In particular, it does not rely on any 
reverse engineering knowledge. 
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Figure 5 - The first-cut logical schema. 
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4.2 The SCHEMA REFINEMENT process 



In most situations, the resulting schema will be too coarse, mainly due to the lack 
of expressive power of the language. The refinement phase is intended to recover 
the implicit constructs and constraints that are buried in the programs. Among the 
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numerous implicit structures and constraints that can be sought (Hainaut, 1996c), 
we will concentrate on six of the most important ones. In addition, we will base 
this process on program analysis, thus ignoring the other sources of information 
such as the data, the user interface, program execution and the documentation 
(whatever its state). 

One of the frustrating aspects of reverse engineering is that one cannot guarantee 
100% recovery of the specifications. As it is true for any knowledge extraction 
process, where quality of the final product is a (hopefully increasing) function of 
the quality and completeness of the sources of information, and of the effort we 
accept to put in the process. 

A. Optional attributes 

In C++, all the class attributes are considered mandatory. By analyzing the way 
each attribute is initialized, processed and used, we can discover whether it accepts 
void, null or default values, which most often stand for absence of value. As an 
example, we will examine the attributes of customer through the behaviour of 
its constructor customer: : customer [line 108] which is where essential 
integrity constraints should be monitored. 

Observations. 

• n_cust receives the value of input argument n_c, 

• name receives the value of argument n, 

• the components of address receive the values of those of argument adr, 

• next receives the value of customers 

• f irst_ord is set to null. 

The analysis is not complete since we must understand the history of each of the 
source variables. Further analysis leads to learning that: 

• the constructor is called from one program point only, namely procedure 
new_cus [line 263]; in its body, we observe that all input arguments of the 
constructor are set to non-empty values yielded by the user, except for zip, for 
which no checking is performed; 

• customers is a global variable initially set to NULL, 

Conclusions. All the attributes of class customer are mandatory (cardinality 
[1-1]) but three of them, namely address.zip, first_ord, and next 
(cardinality [0-1]). By similarly analyzing the origin of the attributes values in the 
constructors we can state the cardinality of the other attributes of the schema. 

B. Exact cardinality of multivalued attributes 

C++ arrays are given a number of cells, but there is no way to declare how many 
cells can, and must, receive actual values. That is the case of attribute 
order . detail [ 10 ] . The extraction process can only abstract such constructs 




144 



Part Three Reverse Engineering 



as an attribute detail [10-10] array. Obviously, this cardinality must be 
refined. Manipulation of the array elements can be found in the procedure 
new_ord that introduces a new order, and in the methods it invokes, namely the 
order : : order constructor and method add_detail. 

Observations. 

• in new_ord, detail [*] .n_prod and detail [*] .quant are set to 0 
through calling the constructor, 

• then, add_detail is invoked as many times as there are non-zero n_prod 
values entered by the user; entering no details is a possible event, 

• as expected, add_detail does not allow more than 10 product numbers to 
be specified. 

Conclusion. The exact cardinality of order. detail is [0-10] instead of 
[10-10]. 

C. Class identifiers 

It is natural that each major object class should be given an explicit identifier, 
allowing users to designate, e.g., a specific customer or a definite product. Name 
patterns and domain knowledge are of particular help in this quest, but we will use 
program pattern analysis, specifically in the constructors, to find possible class 
identifiers. We will concentrate on customer class. 

Observation. The constructor includes an emergency exit [line 109] through 
which no customer instance is created. This exit is triggered by a positive 
answer to the invocation of f ind_customer function. The latter returns the 
customer old (physical address) of the first customer instance with attribute 
n_cust=n_c, i.e. a customer that has the same n_cust value as that one tries to 
introduce. 

Conclusion. n_cust is an identifier of object class customer. 

Through the same analysis, we find the primary identifiers of classes order and 
product. 

D. Attribute identifiers 

For any multivalued compound attribute, the question of whether a uniqueness 
constraint is enforced must be asked. That is the case for order .detail, the 
management rules of which are concentrated in method order : : add_detail. 

Observation. The while loop terminates when either (1) all the detail cells 
have been examined without success, or (2) the first cell with n_prod = 0 has 
been found, or (3) the first cell with n_prod = n_p has been found. Then, a new 
tuple is inserted when an empty cell has been found, i.e. in case 2. In summary, a 
new tuple is inserted when the array does not include another tuple with the same 
n _prod value. 
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Conclusion. Attribute n__prod is unique in the set of detail tuples of any 
order instance, and must be declared an identifier of attribute detail. 

E. Non-set multivalued attributes 

Due to the lack of expressive power of standard programming languages, including 
C++, small sets of values most often are implemented as arrays, such as 
order . detail [ 10 ] . By examining how the elementary values are processed, 
we can learn whether the position or the ordering of these values are significant. 

Observation. The detail attribute is managed in function add_detail. We 
already know that it represents a set of values, and not a bag. Obviously, no 
meaningful ordering seems to be maintained. In addition, the other program 
sections use indexing of the array elements only to get them, and there is no 
apparent meaning associated with this index. 

Conclusion, order . detail is just a set of tuples where the element position 
and ordering are immaterial. 

F. Foreign keys 

Domain knowledge suggests that some links should exist and be maintained 
between order and product instances. The examination of the methods 
order : : add_detail and order : : cost_order gives us the key. 

Observation. The most obvious observation relates to attribute names and 
domains: two attributes, namely order . detail. n__prod and 

product .n_prod, happen to share their names and domains. In addition, one 
of them is an identifier. In the method add_detail, a detail tuple is inserted 
only when the value of input argument n_p identifies a product instance. 
Considering that this method is the only way to add details, we can be sure that 
there is no detail tuples without a matching product instance. Another 
evidence can be found in method cost_order. To compute the cost of the 
order, the body of the method finds the product instance referenced by each 
detail tuple. We observe that the price of this instance is asked for without 
worrying about its existence, which seems to be taken for certain. We conclude 
that each detail tuple is guaranteed to have a matching product instance, 
unless the program is wrong. 

Conclusion. The component n_prod of order. detail is a foreign key to 
object class product. 

G. Access structures 

Unlike higher level data managers, C++ offers no explicit constructs to provide 
programmers with instance collection and traversal techniques. The programmers 
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have thus to implement technical structures to maintain instance collections and to 
access successive instances of a class. Chaining is one of the most popular 
technique to implement ordered set of instances. A chain comprises an external 
head pointer that yields the first element of the chain, and next pointers that yield, 
for each chain element, the next element, if any. Typically, the next value of the 
last element is null. 

Observation. In a C++ class structure, chains generally are implemented through 
instance pointers that are abstracted as attributes defined as attribute- 
name[0-l]: *class-name. The example schema comprises five such 

attributes*, whose behaviour has to be examined in detail. Let us consider 
attribute next of class customer. As expected, this attribute is processed in the 
constructor of its class, and it appears that it is used to chain all the customer 
instances. The global variable customers acts as chain head. This hypothesis is 
confirmed by the analysis of application procedure list_cus, which is based on 
a chain traversal loop. The analysis of the attributes order. next and 
product .next leads to similar conclusions. 

Through the same approach, the pointers customer . first_ord and 
order . next_of__cus t also appear as implementing chains whose head is in a 
customer instance, and chaining order instances. 

The random way instances are inserted in the chains suggests that instance 
ordering is immaterial, and that they implement unstructured sets only. 

Conclusions. There are two kinds of chains, those which merely implement the 
collection of instances of each class, and those which implement access of a list of 
order instances from each customer instance. The first kind of chains can be 
discarded since the abstract 00 model we use includes the notion of instance set of 
classes. The second kind of chains seems more prone to support semantics, and 
must be kept. However, for reasons that will soon appear, these chains need to be 
further processed. 

First we replace it with another equivalent construct, namely multivalued object- 
attribute orders. We observe that an order instance could not be inserted in 
more than one such chain, a property that translate into the following constraint: an 
order instance cannot appear in the orders set of more than one customer 
instance. In other words, orders is a multivalued secondary identifier of class 
customer. 

When a class A includes an object-attribute with domain B, it is non unfrequent 
that class B includes an inverse object-attributes with domain A. This could be the 
case with class customer with attribute orders and class order with attribute 
customer. The analysis of procedure new_ord shows that the value of attribute 
cust of the current order instance is the address of the customet instance in the 



* To be quite precise, we should have proved that each of them is a secondary identifier of its class. 
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chain of which this order instance is inserted. Consequently, attributes 
customer . orders and order . cust must be declared inverse. 

The final logical 00 schema is presented in Figure 6. 




Figure 6 - The refined logical schema. The multivalued foreign key 
detail[*].n_prod is symbolized by a directed arc to the primary id of product. 

4.3 The SCHEMA PREPARATION process 



This phase is intended to prepare the conceptual interpretation by removing 
constructs that are no longer useful. Such is the case for the basic methods 
dedicated to managing and accessing the object instances. We just keep the 
application methods, i.e. those which implement user-oriented functions. In our 
schema, we can discard constructors (including the pseudo-constructor 
add_.de tail) and accessors. Two methods are kept, namely 
customer : : amount_due and order : : cost_order. 



4.4 The SCHEMA DE OPTIMIZATION processes 

The size of the system being too small to exhibit realistic optimization constructs, 
we will concentrate on untranslation reasonnings. However, an important problem 
must be addressed in this process, namely vicious IS-A relations implementing 
parf-o/ relationships (see Section 5.1). 

4.5 The SCHEMA UNTRANSLATION process 

We will consider three important untranslation rules that are intended to recover 
the original conceptual constructs of 00 schemas. 
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A. Transforming the object-attributes 

The schema includes two inverse object-attributes that are transformed globally 
into one-to-many rel-type orders. 




Figure 7 - The basic conceptual schema. 

B. Transforming the complex multivalued attributes 

Attribute detail is compound, multivalued, has an identifier and includes a 
foreign key. This is a typical implementation of a dependent (sometimes called 
weak) object class. This new class inherits the source name detail and the new 
rel-type is called of. 

C Transforming the foreign keys 

Foreign key n_prod of newly defined class detail is replaced with a many-to- 
one rel-type called refer from detail to product. 

The basic concep tual schem a is presented in Figure 7. 
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Figure 8 - The normalized conceptual schema. 
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4.6 The CONCEPTUAL NORMALIZATION process 

The schema obtained so far is modified in such a way that it satisfies corporate 
methodological standards: model, naming conventions, graphical rules, etc. To 
illustrate the process, we propose in Figure 8 an equivalent schema, in which the 
names have been normalized according to local rules (e.g. class names in 
uppercase, attribute names capitalized, no underscores) and in which rel-types with 
attributes are allowed. 

5. PROBLEM SOLVING IN 00 REVERSE ENGINEERING 

Despite its small size and its artificial nature, the program processed in Section 4 
exhibits some of the most common problems that can occur when recovering the 
specifications of a structured collection of object classes. Though larger 
applications will include other kinds of problematic structures, the experiment 
above is sufficient to discuss the minimal requirements for a general methodology 
to perform 00 application reverse engineering with success. 

It is worth recalling some of the advantages of 00 programs (even based on low 
level DMS, such as C++) as compared with more traditional development tools. 
Hierarchical object structures can be made explicit through recursively applying 
the tuple/set constructors. Explicit IS-A hierarchies can be explicitly declared. 
Methods provide a centralized support to integrity control and to logical 
relationships between objects (e.g. through inter-object navigation). 

On the other hand, 00 applications exhibit all the problems that have been found 
in classical legacy systems. The reason is three-fold. Firstly, even if it is state-of- 
the-art, each object manager has its own weaknesses that force the programmer to 
resort to traditional programming techniques to compensate for them. Secondly, 
many object managers lack features that are available in even the most primitive 
file managers (e.g. uniqueness constraint). Finally, and more important, the best 
programming environment, be it 00 or not, cannot force programmers to work in 
the consistent and disciplined way the 00 paradigm seems to naturally imply. 

The rest of the section will discuss 00-specific problems as well as more general 
ones. 

5.1 Ambiguity of the OO paradigm 

One of the most disturbing observation is the fact that different scientific 
communities have given the concept of object class hierarchy distinct 
interpretations (Brachman, 1983). Two conmiunities are concerned with the 
question we address in this paper, namely those from the programming and the 
IS/DB realms. Let us consider three object classes A, B and C; B and C being 
subclasses of A. On the one hand, 00 programmers will generally consider that 
an object is in class A, or in class B or in class C, but in one of them only. For 
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instance, A can be declared an empty or virtual class if all the instances fall in 
either B or C. On the other hand, IS/DB people will consider that each B (or C) 
instance is an A instance as well. In short, in the programming realm, IS-A means 
property inheritance while in the IS/DB realm, IS-A means a subset relation 
between the populations. In more precise terms, if we call prop(O) the function 
that returns the properties (attributes, roles and constraints) of object class O and 
inst( O) the function that returns the current set of instances of class O, we have the 
following time-independent properties: 

• according to the programming community: 

prop (A) c prop (B) & inst (B) n inst(A) = 0 

• according to the IS/DB community: 

inst (B) c inst(A) 
hence prop (A) c prop (B) 

The semantics of the object classes in 00 programs will be strongly influenced 
by the interpretation they have been based on. Recovering a conceptual schema 
from an 00 program can involve an in-depth analysis of the intended meaning of 
IS-A relations, as illustrated in Figure 9 (left), which synthetizes one of the best 
illustrations of the problem. 
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Figure 9 - Interpreting vicious IS-A hierarchies. 



It has been provided by the Borland 00-Pascal (V7) documentation published 
some years ago. The very first example of class hierarchy presented in the tutorial 
is the following. Let us consider object class point, with attributes X and Y. We 
define a subclass circle, with which we associate a new attribute radius, and 
which inherits X and Y interpreted as the circle center coordinates. Generations of 
programmers were introduced to the 00 approach with this particularly awkard 
example which suggests that circles form a special kind of points! The right side 
of Figure 9 proposed a more natural expression of the intended semantics. 
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5.2 Program analysis techniques 

It is now quite obvious that program analysis can be the only way to extract the 
necessary knowledge on object behaviour to make hidden constraints and 
structures explicit. Visually analyzing the body of short object management 
methods can be sufficient in small applications. In can prove unreliable in large 
applications, in which objects tend to get more complex, and the methods much 
larger. Indeed, associating several hundreds of methods with each object class is 
not uncommon. In addition, in complex applications, not all the object behaviour 
will be localized in methods. Operations on sets of objects (sometimes called 
object societies) in which no member emerges as the kernel will often be wired in 
programs instead of in methods. Finally, the way the objects are manipulated in 
the program itself will often bring much interesting information on implicit 
constraints, or on the meaning of obscure attributes. 

Hence the need for powerful program analysis techniques. We will briefly 
describe two classes of techniques, namely component dependency analysis and 
program slicing. 

Two program data components (file, record, object, field, type, frame, constants) 
depends on each other if, at some point of the life of the program (compile time, 
run time) a property (name, type, structure, state, value) of one of them may 
depend of those of the other one. For instance, variables A and B can be 
considered mutually dependent if the program comprises either a statement 
assigning the value of A with B (or conversely) or a chain of statements resulting 
in A and B being assigned the same value at some point of program execution, or 
A being compared to B. The very meaning of such a dependency can vary 
according to the goal of the process: semantical similarity, same structure, 
evidence of a foreign key, etc. A special form of dependency graph is the 
dataflow graph in which directed arcs represents assignment operators only. This 
abstraction can be smaller and more precise that general dependency graphs 
(Anderson, 1996). Such techniques have been illustrated in Section 4 when trying 
to discover foreign keys. 

In short, program slicing works as follows (Weiser, 1984). Let us consider a 
data object D (variable, constant, record type, file, etc.) of a program P, and any 
point p of P (statement, label, inter-statement point). E is the program slice of P 
with respect to criterion (p, D) if it is the subset of the statements of P such that, 
whatever the external conditions of execution, the state of D at point p is the same, 
whether E or P is executed. In other words, all the statements of P that can 
influence the state of D at p form the subprogram E. Normally, E is much shorter 
than P, and will be much easier to examine than P as far as understanding the 
behaviour of D is concerned. A typical example is the slice of a record type (D) at 
a file writing point (p). It is expected that this program excerpt includes all the 
statements that check and manage the records before writing them in the file. 
Further analyzing this slice will be easier than coping with the whole program. An 




152 



Part Three Reverse Engineering 



application of program slicing to database reverse engineering can be found in 
(Henrard, 1996). 

For reasons that are out of the scope of this paper, several slices can be computed 
(Tip, 1994), (Horwitz, 1990). Optimistic slices include most of the statements of Z 
as defined above, but some contributing statements may be lost. Conservative 
slices are garanteed to include all the statements of Z, ... but some others as well. 
Of course, optimistic slices are shorter and cheaper to compute, but also less 
precise, than conservative ones. 

5.3 Schema transformation 

This is the basic tool that can be used to reliably derive schemas from source 
schemas. Schema transformations are ubiquitous techniques that have been 
proposed to support numerous processes in database engineering, and particularly 
in reverse engineering (Hainaut, 1993b). They are operators that replace 
constructs in a schema with other constructs. Generally, the second schema better 
meets definite criteria (normalization, minimality, compliance with a data model, 
etc.) the first one does not meet. The class of semantics-preserving 
transformations is of a particular importance. Such a transformation guarantees 
that the source and the final schemas convey the same semantics, i.e. they describe 
exactly the same application domain, and any situation that can be described by 
one of them can be described by the other one. 

In reverse engineering, schema transformation will be mainly used in the 
Conceptualization phase. Indeed, replacing logical constructs with their 
conceptual equivalent, restructuring the schema to discard optimization constructs 
and normalizing conceptual schemas are typical schema transformations. 

Due to the limited scope of this paper, we will only present three formal 
operators that are used in Section 4 (Figure 10 to 12). Being semantics-preserving, 
they can be read in both directions. The reader will find further information on 
these techniques in (Batini, 1992), (Hainaut, 1993b), (Hainaut, 1996), (Rosenthal, 
1994) for example. In (Blaha, 1996), specific 00 transformations are proposed. 








Figure 10 - Transformation of two inverse object-attributes into a rel-type (and 
conversely). 
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Figure 11 - Extracting a multivalued attribute as an autonomous object class (and 
conversely). 








Figure 12 - Expressing a foreign key as a relationship type (and conversely). 



6. DBRE CASE SUPPORT 



From the analysis developed above, we can draw some minimal requirements that 
should be addressed by CASE tools intended to support reverse engineering of 
legacy systems, included 00 applications. Besides natural functions related to 
specification entry, management and browsing, the tool must offer powerful 
program analysis processors and a large collection of transformation operators. 
From several large scale experiments, we learnt that the process is largely 
interactive and exploratory. Hence the need for extensibility to cope with 
unexpected situations. 

To tackle data reverse engineering projects in a realistic way, we have developed 
a CASE tool suite which is being extended to the object paradigm, not only to 
recover the specifications of 00 applications, but also to address the 00 
expression of traditional applications. 

The DB-MAIN tool can be used either as a toolset to support system (forward 
and reverse) engineering, or as a CASE tool development environment (i.e. a meta- 
CASE tool). 

In its current version (Version 3, November 1997), the tool offers a sophisticated 
support for forward and reverse engineering activities. More specifically, it 
includes the following functions and components: 

• specifications management: access, browsing, creation, update, copy, analysis, 
memorizing; 

• representation of all the project products, and of their relationships: schemas, 
views, source texts, reports, generated programs; 

• view derivation and management; 
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• a generic, wide-spectrum, representation model for conceptual, logical and 
physical objects, according to the most popular value-based, entity-based and 
object-oriented paradigms; 

• semantic and technical annotations attached to each specification object; 

• multiple views of the specifications (4 hypertexts and 2 graphical views); some 
views are particularly intended for very large schemas; 

• a toolbox of about 30 semantics-preserving transformational operators which 
provide a systematic way to carry out such activities as conceptual 
normalization, or the development of optimized logical and physical schemas 
from conceptual schemas, and conversely (i.e. reverse engineering); 

• code generators; report generators; 

• code parsers extracting physical schemas from SQL, COBOL, CODASYL, 
RPG and IMS source programs; 02 and C++ parsers are under development; 

• interactive and programmable text analysers which can be used, a.o. to detect 
complex programming cliches in source texts, to build dataflow and 
dependency diagrams, and to compute program slices; 

• a sophisticated name processor to clean, normalize, convert or translate the 
names of selected objects in a schema; 

• a history manager which records the engineering activities of the analyst, and 
which makes their further replay and analysis possible; 

• import and export of specifications; 

• a series of assistants. An assistant is an expert module in a specific kind of 
tasks, or in a class of problems, intended to help the analyst in frequent, tedious 
or complex tasks. It allows the analyst to develop scripts which automate 
frequent processes. A library of predefined scripts is provided for the most 
frequent activities. Six assistants are available at present: global 
transformations, transformation script development, schema analysis, text 
analysis (including pattern searching, dependency analysis and program 
slicing), schema integration, foreign key analysis; 

• meta functions that allow users to develop new specification objects* and new 
functions, particularly through the Voyager language. 

Database reverse engineering cannot be carried out automatically. In addition, 
no unique tool can support every kind of DBRE projects, due to the large variety 
of problems that are encountered in practice, as opposed to development projects, 
which can profit from fairly standard approaches. Therefore, the DB-MAIN tool 
does not claim to solve DBRE problems automatically. On the contrary, it 
provides a rich collection of configurable toolboxes, together with rapid 
development tools to build ad hoc processors dedicated to specific projects. 

The functions of the DB-MAIN CASE environment has been described in 
previous papers, such as (Hainaut, 1996d). Further information can be found at 



* In the public version meta-objects and meta-relations cannot be created. Only meta-properties can 
be associated with builtin meta-objects. 
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http://www.info.fundp.ac.be/~dbm. A free education version of the tool is 
available. However, some of the 00 extensions still are under development at 
present time (and therefore may be unstable), and will be made available on 
request only. 

7. CONCLUSION 

It is now clear that 00 applications, beyond some positive aspects as far as 
program understanding is concerned, share many of the difficult problems that 
have been experienced in reverse engineering traditional, 3GL, legacy systems. 
The positive aspects derive from a (somewhat) more powerful data model and a 
stronger localisation of the code which manage the data objects. However, the 
weaknesses of some languages concerning data integrity (e.g. C++) and 
development practices inherited from old environments induce a complexity that is 
comparable to that of applications based on traditional languages. Therefore, it 
appears that the techniques developed can be adopted, with some adaptation, to 
tackle 00 applications. 

The experience also shows that data reverse engineering is far from an automated 
process, except for some specific processes, such as preliminary analysis of 
declarative code. Full human control is essential, even if it can be supported by 
powerful tools. 

The last conclusion we would like to propose concerns the training of reverse 
engineering analysts (Hainaut, 1997). While forward engineering is fairly well 
understood and mastered, reverse engineering appears as an engineering discipline 
that makes use of complex theories and techniques known by very few 
professional only. For instance, the concept of program slicing, which is essential 
in program understanding, requires much effort both from the trainers and the 
trainees before being efficiently and reliably mastered. 
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APPENDIX - THE C++ SOURCE CODE 



1 


# include <date.h> 


14 


private : 


2 


#include <alloc.h> 


15 


int n_cust; char name[30]; 


3 


#include <string.h> 


16 


addrT address; order *first_ord; 


4 


#include <stdio.h> 


17 


customer *next; 


5 


#include <io.h> 


18 


public : 


6 


class order; 


19 


customer (int n_c, char *name, 


7 


class product; 




addrT adr) ; 


8 


class customer { 


20 


customer *get_next ( ) ; 


9 


public : 


21 


int get_n_cust ( ) ; 


10 


typedef struct addr_ { 


22 


char *get_name ( ) ; 


11 


char street[40]; char num[5]; 


23 


addrT *get_address ( ) ; 


12 


char zip[5]; char city[20]; 


24 


order *get_f irst_order ( ) ; 


13 


} addrT; 


25 


void set_first_order (order *ord) 
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43 void add_detail (int n_p, int q) ; 

44 order *get_next(); 

45 order *get_next_order ( ) ; 

46 int cost_order ( ) ; 

47 }; 

48 class product { 

49 private : 

50 int n_prod; char name[30]; 

51 int price; product *next; 

52 public : 

53 product(int n_p, char *n, int p) ; 

54 int get_n_prod( ) ; 

55 char *get_name(); 

56 int get_price ( ) ; 

57 product *get_next(); 

58 }; 

59 

60 customer *find_customer (int n_c) ; 

61 order *f ind_order (int n_o) ; 

62 product *find_prod(int n_p) ; 

63 customer *customers = NULL; 

64 order * orders = NULL; 

65 product *products = NULL; 

66 

67 customer *find_customer (int n_c) 

68 { customer *cur; 

69 cur = customers; 

70 while (cur) { 

71 if (cur->get_n_cust ( ) == n_c) 

72 r e turn ( cur ); 

73 cur = cur->get_next ( ) ; 

74 } 

75 return (NULL) ; 

76 } 

77 

78 order *f ind_order (int n_o) 

79 { order *cur; , 

80 cur = orders; 

81 while (cur) { 

82 if (cur->get_n_ord( ) == n_o) 

return (cur) ; 

83 cur = cur->get_next ( ) ; 

84 } 

85 return(NULL) ; 

86 } 

87 

88 product *f ind_prod(int n_p) 

89 { product *cur; 

90 cur = products; 

91 while (cur) { 

92 if (cur->get_n_prod( ) == n_p) 

return (cur) ; 

93 cur->get_next ( ) ; 

94 ) 

95 return (NULL) ; 

96 } 

97 

98 void read_not_null (char *str) 

99 { str[0] = 'NO* 

100 while (str [0] = 

101 } 

102 

103 void read_not_null (int *val) 

104 { *val = 0; 

105 while (*val == 0) scanf("%d”, 

val) ; 

106 } 

107 

108 customer :: customer (int n_c, char 

*n, addrT adr) 



109 { if (find_customer (n_c) ) 

{delete (this) ; return; } 

110 n_cust = n_c; strcpy(name, n) ; 

111 strcpy (address . street , 

adr . street) ; 

112 strcpy( address. niom, adr.num); 

113 strcpy (address. zip, adr. zip); 

114 strcpy (address. city, adr. city); 

115 next = customers; customers = 

this; 

116 first_ord = NULL; 

117 } 

118 

119 customer *customer : : get_next ( ) 

120 { return (next) ; } 

121 

122 int customer : : get_n_cust ( ) 

123 { return (n_cust) ; } 

124 

125 char * customer : : get_name ( ) 

126 { return (name) ; ) 

127 

128 customer : : addrT 

♦customer: :get_address ( ) 

129 { return (&address) ; } 

130 

131 order *customer: : get_f irst_order ( ) 
132 { return (first_ord) ; } 

133 

134 void customer: : set_f irst_order 
(order *f) 

135 { first_ord = f;> 

136 

137 int customer : : amount_due ( ) 

138 { int total; order *cur; 

139 total = 0; 

140 cur = get_f irst_order ( ) ; 

141 while (cur) { 

142 total = total + 

cur->cost_order ( ) ; 

143 cur = cur->get_next_order ( ) ; 

144 ) 

145 return (total) ; 

146 } 

147 

148 order :: order ( int n_o, customer *cus) 

149 { int i; 

150 order *cur, *prev; 

151 if (find_order (n_o) ) { 

152 delete (this) ; 

153 return; 

154 } 

155 n_ord = n_o; 

156 for(i=0; i <10; i++) { 

157 detail [i] .n_prod = 0; 

158 detail [i] .quant = 0; 

159 } 

160 cust = cus; next = orders; 
orders = this; 

cur = cust->get_f irst_order ( ) ; 
prev = NULL; 

164 while (cur) { 

165 prev = cur; 

166 cur = cur->get_next_order ( ) ; 

167 } 

168 next_of_cust = NULL; 

169 if (prev) 

prev->next_of_cust = this; 

170 else cust->set_f irst_order ( this) ; 

171 } 



161 

='\0') scanf ( "%s" , str) 162 
163 




160 
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172 

173 customer *order : : get_customer ( ) 

174 { return (cust) ; } 

175 

176 int order: :get_n_ord() 

177 { return (n_ord) ; } 

178 

179 TDate *order : : get_date ( ) 

180 { return (&ord_date) ; } 

181 

182 order: :detT * order :: get_de tail ( int i) 

183 { return (Scdetail [i] ); } 

184 

185 void order: :add_de tail (int n_p, int q) 

186 { int i; 

187 if ( ! f ind_prod(n_p) ) 

188 printf("the product does not 
exist \n” ) ; 

189 i = 0; 

190 while(i<10){ 

191 if (detail [i] .n_prod == 0) 

break; 

192 if (detail [i] .n_prod == n_p) 

break ; 

193 i++; 

194 } 

195 if(i<10) 

196 if (detail [i] .n_prod != n_p) { 

197 detail [i] .n_prod = n_p; 

198 detail [i] .quant = q; 

199 } 

200 } 

201 

202 order *order: :get_next() 

203 { return (next) ; ) 

204 

205 

206 order * order : : get_next_order ( ) 

207 { return (next_of_cust) ; } 

208 

209 int order : : cost_order ( ) 

210 { int total, i; product *prod; 

211 total = 0; 

212 for(i=0; i<10; i++) { 

213 if (detail [i] .n_prod == 0) 

break; 

214 prod=find _prod (detail [i] ,n_prod) ; 

215 total = total + detail [i] . quant 

216 * prod->get_price ( ) ; 

217 ) 

218 return (total) ; 

219 } 

220 

221 product :: product (int n_p, char *n, 

int p) 

222 { if (find_prod(n_p) ) { 

223 delete (this) ; 

224 return; 

225 } 

226 n _prod = n_p; strcpy(name, n) ; 

227 price = p; 

228 next = products; products = this; 

229 } 

230 

231 product *product : : get_next ( ) 

232 { return (next) ; } 

233 

234 int product: : get_n_prod ( ) 

235 { return (n_prod) ; } 

236 



237 char *product : : get_name ( ) 

238 { return (name) ; } 

239 

240 int product :: get_price ( ) 

241 { return (price) ; } 

242 

243 

244 void new_cus() 

245 { int n_cust; 

246 customer : : addrT addr ; 

247 char name[30]; 

248 printfC’new customer : \nCustomer 

code" ) ; 

249 r ead_no t_nul 1 ( &n_cus t ) ; 

250 printf ( "Customer ' s name : "); 

251 r ead_no t _nu 1 1 ( name ) ; 

252 printf ( "address of 

customerXnstreet : "); 

253 read_not_null (addr . street) ; 

254 printf ( "number : "); 

255 read_not_nul 1 ( addr . num) ; 

256 printf ("zip code : "); 

257 scanf("%s", addr.zip); 

258 printf ("city : "); 

259 read_not_null (addr . city) ; 

260 if(!new customer (n_cust, name, 

addr) ) 

261 printf ("err: cust not created\n") 

262 } 

263 

264 void list_cus() 

265 { customer *cur; 

266 cur = customers; 

267 while (cur) { 

268 printf ("%s amount due : %d\n", 

269 cur->get_name ( ) , 
cur->cunount_due ( ) ) ; 

270 cur = cur->get_next ( ) ; 

271 } 

272 } 

273 

274 void new_stk() 

275 { int n_prod, price; char name[30]; 

276 printf ("new product\n product 

number : " ) ; 

277 read_not_null (&n_prod) ; 

278 printf ("name : "); 

279 read_not_null (name) ; 

280 printf ( "price : "); 

281 read_not_null (&price) ; 

282 if ( !new product (n_prod, name, 

price) ) 

283 printf ( "error : product 
exists \n" ) ; 

284 } 

285 

286 void list_stk() 

287 { product *cur; 

288 cur = products; 

289 while (cur) { 

290 printf ( "%s->%d\n" , 

cur->get_name ( ) , 
cur->get_price ( ) ) ; 

291 cur = cur->get_next ( ) ; 

292 ) 

293 } 

294 

295 void new_ord() 

296 { int n_ord, n_cus,n_prod, quant; 

297 customer *cust; order *ord; 
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298 

299 printfC'New order\norder num: "); 

300 read_not_null (&n_ord) ; 

301 cust = NULL; 

302 while (! cust) { 

303 printf ( "customer number : "); 

304 read_not_nul 1 ( &n_cus ) ; 

305 cust = f ind_customer (n_cus) ; 

306 ) 

307 ord = new order (n_ord, cust) ; 

308 if (lord) { 

309 printf ("err: order existsNn"); 

310 return; 

311 ) 

312 printf ( "product num (0=end) : "); 

313 scanf("%d", &n_prod) ; 

314 while(n_prod != 0) { 

315 printf ( "quantity : "); 

316 read_not_nul 1 ( &quant ) ; 

317 ord->add_detail (n_prod, quant); 

318 printf ( "product n° (0=end) : "); 

319 r ead_not_nul 1 ( &n_prod) ; 

320 } 

321 ) 

322 

323 void list_ord() 

324 { order *cur; order: :detT *det; 

325 int i; 

326 cur = orders; 

327 while (cur ){ 



328 


printf ( "order : %d\n", 




cur 


->get_n_ord( ) ) ; 


329 


for(i=0; i<10; i++) { 


330 


det = cur->get_detail (i) ; 


331 


printf ( " 


\t%d %d\n", 


332 


det->n_prod, det->quant) ; 


333 


) 




334 


printf (" \tcost order : %d\n", 


335 


cur->cost_order ( ) ) ; 


336 


cur = cur- 


>get_next ( ) ; 


337 


} 




338 } 






339 






340 int main ( ) 




341 { 


char choice; 




342 


choice = • ' 


; 


343 


while (choice 


!= •0'){ 


344 


printf ( "1 


New customer\n2 New 




stock\n3 New order\n4 List of 




customers\n5 List of stoc)cs\n6 




List of ordersXnO end\n"); 


345 


scanf ( "%c" 


, &choice) ; 


346 


switch(choice) { 


347 


case ' 1 ' 


: new_cus ( ) ; 


348 


break; 




349 


case '2* 


: new_stk(); 


350 


break; 




351 


case * 3 ‘ 


: new_ord ( ) ; 


352 


break; 




353 


case ' 4 ' 


: list_cus(); 


354 


break 




355 


ca.se ' 5 ' 


: list_stk(); 


356 


break 




357 


case '6' 


: list_ord(); 


358 


break; 




359 


} 




360 


) 




361 } 
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Abstract 

In spite of advances in database technology, huge amounts of data around the 
world are still managed in files and accessed by COBOL application programs. 
It is a major challenge to migrate this data and these applications to modern 
database technology. 

This work provides a basis for overcoming the mismatch between the data 
model of COBOL and modern object-oriented or semantic data models. We 
propose an algorithm to determine references and inclusion dependencies in 
a COBOL data structure. This information makes it possible to generate 
a schema in any data model, be it a semantic data model like the entity- 
relationship model, an object-oriented model like ODMG-93 or the relational 
model. 



1 INTRODUCTION 



Statement of problem 

In spite of advances in database technology, huge amounts of data around 
the world are still managed in files and accessed by COBOL application pro- 
grams. It is a major challenge to migrate this data and these applications 
to modern database technology. This task is particularly important given the 
current rapid progress in telecommunications technology. The demands on the 
information retrieval systems in companies and organizations will increase as 
a consequence of the necessity of making information available to larger com- 
munities via the internet and intranets. 

Since many COBOL systems can be characterized as legacy systems (Brodie 
and Stonebraker 1995), i.e. badly documented, brittle and monolithic, they 
are difficult to modify and thus expensive to maintain. Reverse engineering 
systems that help programmers understand the underlying data models of 
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legacy systems are thus of practical and economical as well as theoretical 
importance. 

This paper addresses the problem of transforming implicitly represented 
information into explicit form. We concentrate on two of the most problematic 
features of COBOL data structures, namely (1) Implicitly modeled references 
between record types, and (2) Internally unspecified data fields. 



Main contributions 

The techniques presented in this paper are novel, since an analysis of the 
source code is used to identify referential integrity constraints within the data 
structures. This technique is considerably more reliable than previous work 
in this domain, see section 7, that is limited to investigations of the data 
definitions rather than the source code as a whole. 

Some important aspects of our approach are: 

• We search the application source code for information on references between 
aggregates that in value-based systems, such as COBOL programs, are 
represented implicitly. 

• The structure of unspecified record fields is determined by a comparison 
with the structure of other variables used to access the unspecified field. 

• We propose a strategy for determining value correspondences between 
record fields that can be used to determine inclusion dependencies. 

This work provides a basis for overcoming the mismatch between the data 
model of COBOL and modern object-oriented or semantic data models. The 
information resulting from the five analysis steps can be used in a straightfor- 
ward way to generate a schema in any data model, be it a semantic data model 
like the entity-relationship model, an object-oriented model like ODMG-93 or 
the relational model. 

An important application of this work is migration of old applications to 
modern technology. Many organizations today are dependent on information 
systems implemented in old-fashioned technology that cannot respond to the 
requirements of today. Typically, a large organization stores its mission-critical 
data in a very large IMS database and manipulates the data through trans- 
actions consisting of COBOL application programs. Such a system does not 
allow a modular or component-based approach and is therefore difficult and 
expensive to maintain. The successful reengineering of such systems must in 
all cases start with eliciting the business objects and business processes that 
are hidden in the application programs. 

Another important application area can be found in the field of database 
interoperation, since a collaboration between databases is only possible if they 
are understood. This is made possible by reverse engineering methods. It is 
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generally agreed that a common representation should be used to hide the 
heterogeneity of the different systems. Object-oriented data models are well 
suited for this purpose through their concept of encapsulation. 



Overview of the method 

The algorithm in this paper consists of five steps as follows. An overview is 
given in figure 1. Each step will be detailed in the remainder of the paper. An 
example is given in the appendix. 



1 . Initialization 2. Structure Resolution 3. Spreading set construction 




Figure 1 Overview of the algorithm 



1. Initialization: In the initialization phase, a symbol table describing the 
items in the data division of the COBOL program is created as a first-cut 
version of the data structures. This is done using existing compiler and 
parsing technology and will not be described in detail (Aho, Sethi and 
Ullman 1986). 

2. Structure resolution: In the structure resolution phase, the structure of 
record fields that are unspecified in the data division is derived from the 
structure of other variables with a known structure and that are used to 
access the field. The detailed structure of the record types is represented 
by a derived type as described in section 2. 

3. Construction of spreading sets: In the next phase, we use data-flow 
analysis to build spreading sets consisting of all variables through which a 
particular value has flown or will eventually flow. The spreading sets give 
information on how record fields are related. This phase is described in 
section 3. 

4. Graph construction: In the graph construction phase, the information 
acquired in the previous steps is represented in a graph as described in 
section 4. The graph represents persistent data items and the relationships 
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between them. This is a way of explicitly expressing the relationships be- 
tween the record types of a COBOL application. 

5. Value correspondence analysis: In the value correspondence analysis, 
inclusion dependencies are derived from source code and data instances. 
This will be detailed in section 5. 



The information resulting from the five analysis steps can be used in a 
straightforward way to generate a schema in any data model, be it a seman- 
tic data model like the entity-relationship model, an object-oriented model 
like ODMG-93 or the relational model. This paper describes how application 
semantics are determined from analyzing an existing system. Due to space 
limitations, the process of schema creation is not covered. 

A general assumption of this work is that we focus on data-centered infor- 
mation systems. A typical example is administrative systems including large 
amounts of data and application programs, both batch and interactive. The 
user interface typically consists of alpha-numerical forms managed by a pro- 
gram that also verifies the correctness of the entered data. The output of the 
system is through forms or reports. The data access is thus predefined to a 
large extent, i.e. queries to the database are compiled and interactively writ- 
ten ad-hoc queries are rare. This means that it is possible to analyse statically 
the great majority of the data manipulation statements. 



A quick look at COBOL data definitions 

The data division of a COBOL program defines a set of data items that can be 
elementary or structured. An elementary data item has a scalar type {numeric, 
alphanumeric or alphabetic) and a size. A structured data item has a set of 
components. The components of the i:th structured data item is a compound 
(cii , . . . , , . . . , Ci„ ) of elementary or structured data items. 

A structure is a hierarchy of components, where the root node is a structured 
data item (or record type) and the leaves are elementary data items. Data 
items defined in the file section of a COBOL program are associated with 
a file where their values can be stored. Values of data items defined in the 
working storage section are volatile. Data items in a record structure can be 
arrays. 

It is possible to rename a set of consecutive persistent data items with 
the RENAMES statement. This can be done over aggregate boundaries giv- 
ing overlapping structures. The statement REDEFINES is analogous to the 
RENAMES statement but applies to non-persistent data. 

To update sequential files, transaction files are used. These are temporary 
files where operations are written that are later run in batch jobs. 
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2 STRUCTURE RESOLUTION (SR) 

A serious problem is that in COBOL, variables are often declared as fiat al- 
phanumeric fields with an unspecified internal structure. This is possible since 
the type checking is limited to the predefined scalar data types. A structured 
record will be represented as a contiguous block of storage at runtime. If 
the record does not involve items with a special internal format to optimize 
storage space and numeric computations, then this contiguous block may be 
considered simply as a long character string with the elementary names serv- 
ing as names for various substrings within the area. Thus, the true structure 
of the persistent record type cannot be determined by examination of the 
record declaration, but must instead be derived from the variables used to 
access the record type. 

We define a matching operation as an operation suggesting that the values of 
its operands are the same. In COBOL, the matching operations are those that 
define variables, i.e. give them a value, and comparisons for equality and non- 
equality. A match also occurs in the parameter passing when sub-programs 
are invoked. 

Figures 2.a and b show the definitions of the three record types PER- 
SON, PERSON-INTERNAL AND OFFICE-DATA. Figure 2.c shows a set of 
matching operations on the record types. 



Derived types 

It is possible to draw conclusions about the structure of one data item based on 
the structure of other data items occurring as operands in the same match- 
ing operation. To gain a complete view of the data, the different variable 
structures are combined into a derived type that includes all different inter- 
pretations of a specific data item. 

A derived type is a hierarchical structure with derived elementary data 
items as leaves and derived structured items as nodes*. Derived data items 
are computed by first creating a set containing the start and end positions of 
all leaves. The structured items are then added which means that the structure 
is built bottom-up. 

Figure 2.d shows a schematic representation of the record types defined in 
2.a and b. The space between two vertical bars represents the extension of a 
component. Placing the schematic representations of a set of record types on 
top of each other makes it possible to visualize the spatial correspondences be- 
tween the components of the record types. The schematic representation of the 
derived type is shown in figure 2.e. Intuitively, the derived type is constructed 
by pushing the smaller components downwards. The smallest components will 

* A derived type is a structure where items on one level may share items on the level below, 
it is thus not a tree. 
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a) FILE-SECTION 
01 PERSON. 

05 IDENTIFIER PIC X(26). 
05 DEPARTMENT PIC X(10) . 
05 OFFICE PIC X(160) . 

c ) PROCEDURE-DIVISION 

MOVE PERSON TO PERSON- INTERNAL . 
MOVE OFFICE TO OFFICE-DATA. 



b ) WORKING- STORAGE- SECTI ON 

01 PERSON- INTERNAL. 

05 ID PIC X(39) . 

05 DATA PIC X(157) . 

01 OFFICE-DATA. 

05 BUILDING PIC X(3). 

05 FLOOR PIC X. 

05 OFFICE-NUMBER PIC X(2). 
05 TEL PIC X(4) . 

05 ADDRESS PIC X(150) . 



d) 

I IDENTIFIER 1 DEPARTMENT 
I ID 





OFFICE 

1 


DATA 






BUILDING 


1 FLOOR 1 


OFFICE- 

NUMBER 


1 TEL 1 


! ADDRESS 



e) 



I ID 

I 

I IDENTIFIER | DEPARTMENT | BUILDING 



OFFICE 



FLOOR I 



DATA 

OFFICE- 

NUMBER 



I 

I 



TEL 1 ADDRESS I 



PERSON 




NUMBER 



Figure 2 Structure resolution 



thus sink to the bottom and become leaves while the larger ones rest at the 
top and become nodes as shown in figure 2.f. Note that the generated struc- 
ture is not a tree. It is quite common that multiple structure are defined on 
a set of leaf data items. The desired structure must be determined by a user, 
the structure resolution process only shows how data is structured within the 
legacy code. 

Arrays can be used as operands in matching operations. In SR, an ar- 
ray is treated as a structured data item with as many components as it has 
array items and with all components having the same size and the same sub- 
components. The array case can thus be reduced to the case of scalar data 
types. For arrays with a length that depends on the value of an integer vari- 
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able, we set the size as if the array had its maximal number of items and 
apply the same technique as for fixed-size arrays. 

The RENAMES and REDEFINES statements in COBOL are used to de- 
clare multiple names for one single data area. From the point of view of SR, 
this is equivalent to defining a name that is given a structure defined else- 
where instead of in the declaration of the name. Thus, these statements do 
not cause any particular difficulties. 



3 CONSTRUCTION OF SPREADING SETS 

This section describes the construction of spreading sets which represent the 
links between variables in a program. A matching operation involving record 
fields indicates that the fields are related. A matching of two record fields can 
be done indirectly through a series of matching operations involving other 
variables, persistent or transient, and we must thus follow the trace from 
one field to another through a chain of matching operations. The matching 
operations group variables into equivalence classes called spreading sets. The 
identification of spreading sets is the key to establishing the relationships 
between the records. 

However, record fields are matched differently depending on the flow of ex- 
ecution in the program. This means that all possible definitions and uses of 
a record field must be taken into consideration. This can be done using data- 
flow analysis, which is a well-known technique originally used for compiler 
optimization, (see e.g. (Aho et al. 1986)). The presence of a matching oper- 
ation indicates that the programmer considered that the record fields could 
share a value. Every path in the program flow graph containing a matching 
operation thus represents a possible relationship. 

We start by defining a number of useful concepts. A point in a sequence 
of statements is before or after any statement. A use of a variable v is any 
occurrence of t; as an operand. A definition of i; is a statement where v is 
assigned a value. A redefinition of a variable v, given a definition di of v, is a 
definition di^i of v such that there is no other definition dj of v between di 
and di^i. A definition originates in a source that can be another variable, a 
persistent data item read from a file through a read operation or a value from 
an external interface, e.g. a user interface. 

Definition-use orders 

In order to establish the flow of values in the program, we first need to deter- 
mine, given the assignment of a value to a variable x, which other variables 
Pi are assigned the value of x before x is redefined with another value. For 
this purpose, we define the notion of a definition-use order (du-order), which 
represents the flow of values in the program. Examples of du-orders are shown 
in figure 4. 
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To compute the du-orders, we need to decide where the variables are defined 
and where they are used. For this purpose a flow-graph, representing the flow 
of control in the program, is constructed using adaptations of standard data- 
flow analysis techniques (Aho et al. 1986). 

To build the flow-graph, the program is divided into basic blocks with the 
property of having exactly one entry point and one exit point. Investigating 
whole blocks of statements rather than single statements makes the analysis 
considerably faster. The basic blocks are the nodes of the flow-graph, while 
the edges represent a possible flow of control from one block to another. An 
example of a flow-graph is shown in flgure 3. 



BO 


1: read v 




2 : read y 



use[B0] = nil 
def[B0] = {l:v, 2:y) 
inU[B0] = nil 

outU[B0] = {5;v, 7:v, 3;v, 4:y) 
inD[B0] = {5:c, 6:b, 9:a, ll:x, 
outD[B0] = {5:c, 6:b, 7:y, 9:a, 



4:d, 3:z, l:v, 2:y) 
ll:x, 4:d, 3:z) 



B1 


3: 


move 


V 


to 


z 




4: 


move 


y 


to 


d 



use[Bl] = {3:v, 4:y} 

def[Bl] = {3:z, 4:d) 

inU[Bl] = {5:v, 7:v, 3:v, 4:y) 

outU[Bl] = {5:v, 6:d, 7:v, 8;z, 9:z> 

inD[Bl] = {5:C, 6:b, l:y, 9:a, ll:x, 4:d, 3:z) 

outD[Bl] = {5:c, 6:b, 7:y, 9:a, 10:d, ll:x) 



B2 


5: 


move 


V 


to 


c 




6; 


move 


d 


to 


b 



7: 


move V to y 


8: 


if z = y .... 


9: 


move z to a 


10: 


move 3 to d 


11 


move d to X 


12 


if d = a 



def[B2] = {5:c, 6:b) 
use[B2] = {5:v, 6:d) 
inU[B2] = {5:v, 6:d) 
outU[B2] = nil 
inD[B2] = {5:c, 6:b) 
outD[B2] = nil 



def[B3] = {7:y, 9:a, 10:d, ll:x) 

use[B3] = {7;v, 8;z, 9:z} 

inU[B3] = {7:v, 8:z, 9:z} 

outU[B3] = nil 

inD[B3] = {l:y, 9:a, 10:d, ll:x) 

outD[B3] = nil 



Figure 3 Use-deflnition sets 

For each deflnition of a variable v in the program, we are interested in the 
variables to which the value is propagated before v is redefined. To find out 
which these variables are, we establish six use-definition sets for each basic 
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block B that mcike it possible to iteratively compute data-flow equations to 
be used in determining the value flows. 

The use-deflnition sets are use[B] and def[B]^ which contain statements 
where variables are used and deflned within a block B. outU[B] and inU[B] 
contain statements where variables are used and are used to calculate the 
du-orders. The sets outD[B] and inD[B] contain deflnitions of variables and 
are used to calculate false redeflnitions as will be described below. The use- 
deflnition sets are deflned as follows: 

• def[B] is the set of defining statements that occur in B. Referring to figure 
3, def[Bl] contains the two statements 3:z and 4'd since z and d are defined 
in Bi. 

• use[B] is the set of statements Si in B where 

1. a data item Xj is used and 

2. there is no statement in B prior to s* defining xj. 

In figure 3, in block def[B3], y is used in statement 8 but since it is defined 
in statement 7, it is not included in use[B3]. 

• outD[B] is the set of statements Si in blocks subsequent to S, where a 
variable v is defined and such that there is no redefinition of v between 
the end of block B and any of the Si. The set contains the first statement 
in each branch where a variable is defined seen from the end of B. E.g., 
outD[Bl] includes the statement 10:d which is replaced by the statement 
4:d in outD[B0] since d is redefined in JBi. 

• inD[B] is the set of statements si in block B or in blocks subsequent to 
jB, where a variable v is defined and such that there is no definition of v 
between the point just before block B and any of the Si, The set contains 
the first statement where a variable is defined seen from the beginning of 
B. E.g. outD[B2] includes only the definitions made in B 2 since ther^ are 
no definitions after B 2 . 

• outU[B] is the set of statements S{ in blocks subsequent to B, where a data 
item is used, and such that there is no definition of that data item between 
the end of block B and any of the Si. The set contains all statements where 
a variable is used before it is redefined seen from the end of B. E.g., the 
statement ll:d is not in outU[Bl] since d is defined ll:d which is between 
the end of Bi and ll:d. 

• inU[B] is the set of statements in or after B, where a variable v is used 
and such that there is no definition of v between the point just before B 
and the s*. The set contains all statements where a variable is used before 
it is redefined seen from the beginning of B. 

use[B] and def[B] are constants that are constructed by an examination of 
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the statements in B. inD[B] and outD[B] are calculated with the following 
data-flow equations: 

inD[B] = {outD[B] -d def[B])Udef[B] 

outD[B] = U inD[S]^ 5 is an immediate successor of B 

The operator < opi > —d < 0 P 2 > is defined on sets of statements and 
removes the statements from < opi > that define the same variable as a 
statement in < op 2 >• 

The first equation states that a definition of a variable at the beginning 
of block B is either a definition that is present also after B and so that the 
variable is not defined in 5, or else a definition of the variable in B*. inU[B] 
and outU[B] are calculated by the following data-fiow equations: 

inU[B] = {outU[B] —u def[B]) U use[B] 

outU[B] = IJ inU[S], S is a successor of B 

The operator < opi > -u < op 2 > is defined on sets of statements 
and removes the statements from < opi > that use a variable defined in a 
statement in < op 2 >. 

The first equation states that given a variable that is used but that has 
not been defined at the point before B, (i.e. it is used in J5 or a block that 
is subsequent to B), there are two possibilities. Either it is undefined after B 
and not defined within B or else it is used within B before it is redefined in 
B. The second equation states that the uses with no definition at the end of 
B are the union of the uses at the beginning of its successors’ blocks. 

The data-fiow equations can be calculated by iterative data-fiow analysis 
that works for arbitrary fiow-graphs. Algorithms for calculating the data-flow 
equations are well-known, see e.g. (Aho et al. 1986) for a discussion. 



Constructing du-orders 

The use-definition sets make it possible to derive the flow of values in the 
program. Given a definition d of the variable v in the block B, v is redefined 
in: 

1. if ?; is redefined in a statement s occuring after d in B, then s is the only 
redefinition of v or else 

2. the statements in outD[B] where v is defined. 

The statements related to d by propagation are: 



U denotes union of sets. 
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1. the statements Si in which a variable w is assigned the value of v, such that 
Si occurs in B after d but before any redefinition of v and w is not related 
to V by propagation. 

2. the statements in out[B] where a variable w is assigned the value of v and 
w is not related to v by propagation. 



Note that we must eliminate redefinitions from variables that occur earlier 
in the du-order and thus have the same value as v. We must also check that 
the redefinitions are not done from a variable that is in the same spreading 
set*. This case is treated below. 

A starting point is a definition where the source is not a variable. This can 
be a READ or an ACCEPT statement, or the assignment of a constant as 
default value. Du-orders are calculated for all starting points. 

Figure 4 shows du-orders of the fiow-graph in figure 3. There is one du-order 
with the starting point in statement 2 and another with the starting point 
in statement 1. Given the definition of y in statement 2, use[Bo] contains no 
statement including j/, but outU[Bo] contains a use of y in statement 4 where 
d is defined, (mtU[Bi] contains no use of y which means that it is not further 
used below Bi. Note that y is actually used in statement 8, but only after it 
has been redefined in statement 7. The value of y is propagated to d which is 
not redefined within B\. use[Bi] contains no use of d but outU[Bi] contains a 
use of d in statement 6. outU[B 2 ] and outU[Bs] are both empty. The du-order 
with starting point in statement 1 is constructed analogously. 



2 : read y 

i 

4 : move y to d 

I 

6 : move d to b 



1 : read v 




9 : move z to a 



Figure 4 Du-orders 



Spreading sets 

The du-orders represent a vertical fiow of values within a program. There 
is also a horisontal extension of a value represented by statements where 
variables are compared for equality. Such a statement indicates the possibility 
that the variables have the same value, which is all we need. 

An eq- comparison is either a comparison for equality or non-equality. A du- 
order ds\ is related to another du-order ds 2 by the sharing relation, denoted 



We call this a false redefinition 
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if there exists an eq-comparison involving a variable of some statement in 
dsi and a variable of some statement in ds 2 ^ $ partitions the du-orders into 
equivalence sets A spreading-set is defined as every variable occuring in a 
statement of an equivalence set under 
To find out which du-orders are related, we must determine which variables 
in statements in the du-orders are matched against each other. Given a defini- 
tion d of the variable v in the block 5, the variables that are matched against 
V are determined as follows: 



1. variables in eq-comparisons in B that include v but that occur after d, 

2. variables in eq-comparisons in outU[B] that include v . 

Referring to the example in figure 5, the variables d and a are matched in 
statement 11. The du-orders thus form an equivalence set 0. 



A spreading-set is defined as every variable occuring in a statement of an equiv- 
alence set under 



The spreading set consists of every variable in any statement of 0, i.e. {y, 
d, b, a, V, z, c}. 



2 : read y 



1 : read v 



4: move y to d 3. v to z 



I 



^ 7 : move v to y 

5 : move v to c 



6 : move d to b 



9 : move z to a 



Figure 5 Du-orders related under $ 

The spreading sets are used to construct the connection graph as described 
in section 4. 

False redefinitions 

When the spreading sets are constructed, it must be checked whether a re- 
definition is made from a source in the same spreading set. If this is the case 
the du-order must be extended with the uses up to the next redefinition. 

An additional requirement is that the redefinition of a variable v must not 
be done from a source that has the same value as v. This occurs when the 
redefinition is done from a variable in the same du-order or spreading set. 
The former case is treated during construction of the spreading sets and has 
already been discussed. The latter is trickier since we cannot know at the time 
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of constructing the du-orders which variables will finally end up in the same 
spreading set. 

Consider figure 3 where in statement 7, y was redefined with the value of 
V. After it has been established that the two du-orders in figure 4 are in the 
same spreading set, we must check if the definition of y reaching statement 
7 is a definition within the spreading set. If this is the case, then we know 
that in statement 7, y is assigned the value it already has. In that case, the 
redefinition is invalid and we must recalculate the du-order based on the next 
redefinition of y. This can be done as before by identifying the statements 
where y gives its value to another variable after statement 7 but before y is 
redefined. 

As there is no definition of y after statement 8, we must include all uses of 
y until the end of block Bs into the du-order of statement 2. The only use of 
y is in statement 8 where it is compared to z that is already in the spreading 
set. The du-order therefore does not change. 

This process must be repeated until no redefinition is done from a source 
within the same spreading set. Given a definition d of a; in B, the redefinitions 
of d can be determined as: 

1. If there is a definition d! of a: in B after d then d' is the only redefinition, 
or else 

2. The definitions of x in outD[B], 

For each redefinition JZ, we must check whether in iZ a variable y is used 
that has been defined in a statement s that occurs in the same spreading set 
and such that y has not been defined since it was defined in 5. To calculate the 
latest definition of a variable, we can use a standard technique from data-fiow 
analysis called reaching definitions. The calculation of reaching definitions 
is analogous to the calculation of definition-use sets described above, with 
the difference that reaching definitions are calculated top-down instead of 
bottom-up. We do not describe the details of this computation, the complete 
procedure can be found in (Aho et al. 1986). Given that we have found a false 
redefinition, we recalculate the du-orders and the spreading sets as before 
until they include no false redefinitions. 

Arrays 

A match involving an array is a different case since the array index cannot 
be determined statically. There are three cases when it is possible to include 
arrays into the calculation of spreading sets. 

1. If it can be shown that all items in the array come from the same source. 

2. If it can be concluded that the index value has not changed between a 
definition and a use of an array item. 
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3. If it can be concluded that the index value when the item is used is within 
the interval of index values when the array items were defined. 

It is necessary to eliminate transitive redefinitions using variables that oc- 
cur higher up in the du-order and thus have the same value as v. However, 
this is not enough since we must also check that the redefinitions are not 
done from a variable that is in the same spreading set. 

4 CONNECTION GRAPH CONSTRUCTION 

The fourth step in the algorithm is the construction of a connection graph 
representing the data structure of the COBOL application. The connection 
graph includes information from the symbol table, the derived types and the 
spreading sets. Vertices in the connection graph are persistent independent 
data items, i.e. record types defined in the file section. A vertex is detailed 
according to the derived type associated with the data item that it represents. 
An edge is defined between any two variables or data items that occur in the 
same spreading set. The connection graph is constructed as follows: 

1. Each persistent record type Ai defines a vertex Vi in the graph. 

2. The structure of Vi is created top-down. Starting from the top, each data 
item d of Ai that has a derived type f is replaced by t together with all its 
components. If d has no derived type, it is used in the structure of Vi. 

3. An edge is defined between any two data items that are members of the 
same spreading set. These data items are called anchors. 

4. When a record type includes one single field, an edge can be defined on 
the whole record type. If this is the case, then the vertex representing the 
record type is merged with the vertex at the other end of the edge. Two 
record types connected by an edge are thus merged into one single vertex 
and a record type that is linked to a field of another record type is merged 
with that field. 

Transaction file records are not merged with the record type it is used to 
update, even if it is persistent. The reason is that the derived type of the 
transaction file record and the data file record will not be the same due to the 
additional meta attributes in the transaction file record that specify the type 
of operation to be done on the data file record. The transaction file records 
are excluded from the connection graph even though they are persistent since 
they do not contribute any new information from the domain of discourse. 

5 VALUE CORRESPONDENCE ANALYSIS 

In order to understand relationships between record types, it is valuable to 
have information on the correspondences between the values of the anchors. 
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An important example is inclusion dependencies, which correspond to refer- 
ential integrity constraints in object-oriented data models. 

Referential integrity can be enforced by external key specifications. How- 
ever, in older systems it is mostly maintained in the application code, i.e. a 
referencing data item is always assigned the value of an existing record. 

An inclusion dependency A :< B is established when it can be determined 
that values inserted into A always come from the set B. This can be traced 
in assignment patterns in the source code by means of spreading sets. 



Source code analysis 

It is sometimes possible to determine inclusion dependencies by analyzing 
patterns of assignments. Recall the definition of spreading-sets in section 3 
as the set of variables that share some value. Let X and Y be components of 
persistent data items and let an access of X be a read or a write of X. 

• If X is accessed in every spreading set where Y is written, then Y ■< X. 

• If X is accessed in every spreading set where Y is written and Y is accessed 
in every spreading set where X is written, then X = Y. 

Note that in the first point it is enough that the constraint holds where Y 
is written. Y may occur in a spreading set where X does not occur as long as 
Fis not written. Using the spreading sets, we can detect cases when there are 
equal values for two anchors. It is not possible to determine whether values 
are different, so we cannot detect disjointness. The reason is that we cannot 
exlude that two different spreading sets could contain the same value. For the 
same reason, it is impossible to detect a proper subset, since we do not know 
whether one anchor has values that another has not. 



Data analysis 

Information on inclusion dependencies can also be obtained by investigating 
the data. We determine the correspondences between the values of anchors. 
The anchors are attributes that are used to relate aggregates and can thus be 
used to infer referential integrity constraints between aggregates. We therefore 
focus the data analysis on the anchors. The sets of values of two anchors can 
be disjoint, have a non-empty intersection, have a subset relationship or be 
equal. 

To calculate these correspondences, the record types are sorted with re- 
spect to the investigated attributes. The algorithm that checks for inclusion 
dependency works on two files simultaneously. One block from each file is read 
and the values are compared. Given that the files are sorted, it is enough to 
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traverse each file once. This is done for each pair of anchors. The total cost 
is: 



edgeNumb * {{BAilogBAi + 5a J + {BAjlogBAj + Baj)) 



where edgeNumb is the total number of edges in the connection graph. The 
cost for one anchor is the sum of the cost of the sort and of the traversal of 
the file. Our algorithm calculates inclusion dependencies for large connection 
graphs in reasonable time. 



6 IDENTIFYING FUNCTIONAL DEPENDENCIES 

This section describes how functional dependencies are derived. The data 
structure represented in the connection graph may include redundancies intro- 
duced for efficiency purposes e.g. in order to reduce query response time. The 
transformation of a data definition for efficiency purposes is called denormal- 
ization. In order to remove these redundancies, we normalize the denormalized 
schema based on functional dependencies. 

A functional dependency (FD) is a constraint in terms of specific values 
of instances and it is thus necessary to know the actual values in order to 
decide if the constraint is satisfied or not. Since the values are not explicitly 
mentioned in the source code except for constants, a functional dependency 
can rarely be traced in the source code. 

Another source of information is the data. By identifying sources and tar- 
gets of functional dependencies, we can determine functional dependencies in 
an analysis of the data. Once the functional dependencies are determined, 
redundancies can be removed through decomposition with a standard algo- 
rithm. 

The problem with this approach is that an exhaustive search is unfeasible for 
real-life-sized schemas. The number of possible combinations of sources and 
targets is too large. Our approach is to restrict the number of possible sources 
of functional dependencies to bring down the set of possible combinations. To 
achieve this, it is tempting to use the information in the connection graph. 
Attributes on which edges are defined are often keys, i.e. sources of functional 
dependencies. However, one of the most important reasons to denormalize 
a database is to avoid costly join operations. A denormalized aggregate is 
a materialized join and a query on such an aggregate can be limited to a 
selection of the desired instances. Therefore, in a denormalized database, join- 
operations are likely to be scarce and the strategy to use eq-comparisons to 
determine significant attributes may well be a shot in the dark. 

Our strategy to find functional dependencies is based on the assumption 
that a data definition models a set of object types in the domain of discourse 
of the application. We say that a set of functional dependencies F is a minimal 
cover if: 
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1. For all {X A) e F is A a. single attribute. 

2. For no [x A) e F is the set F - {X A} equivalent to F, 

3. For no {X A) e F and proper subset Z of X is the set F - {X -> 
A}\J {Z A} equivalent to F. 

We define an object type as: 

Let Ft be a minimal cover of FDs such that, taken pairwise, there is a bijective 
mapping between the targets of the FDs. An object type is a set consisting of the 
attributes on which these FDs are defined. 

The sources of the FDs are called identifiers of the object type, the targets 
of the functional dependencies are called properties. 



Identification of object type identifiers 



Access of a single instance of an object type is done through its identifier. This 
is true regardless of whether the object type identifier is also the identifier of 
the aggregate representing the object type. Figures 6. a through d show a set 
of operations used to manipulate the record MACHINE-RESP. 



MACHINE-RESP 

MACHINE-NAME 

MACHINE-UNIT 

IP- ADDRESS 

RESP-ID 

RESP-UNIT 

RESP-TEL 



MACHINE-NAME —A MACHINE-UNIT, IP-ADDRESS, 
RESP-ID — RESP-UNIT, RESP-TEL 
MACHINE-NAME, RESP-ID — > O 



a) delete 

from MACHINE-RESP 

where MACHINE-NAME = : MACHINE 



b) delete 

from MACHINE-RESP 

where RESP-ID = : PERSON-ID 



c) delete 

from MACHINE-RESP 

where RESP-ID = : PERSON-ID 

and MACHINE-NAME = : MACHINE 

d) select machineName 
from MACHINE-RESP 

where RESP-ID = ; PERSON-ID 



Figure 6 Aggregate in INF 

These operations are expressed in SQL, but the principle is the same in a 
COBOL program, the only difference is that in COBOL more code is required 
to do the same thing. These are basic operations necessary to maintain the 
data of MACHINE-RESP. This type of operation uses object type identifiers 
as selection criteria. A typical operation would be to find out which machines a 
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given person takes care of, like in the query in figure 6.d. The search condition 
indicates that the identifier of the object type is Person. A data manipulation 
operation is an operation that is used to maintain the data. Examples of data 
manipulation operations are: 

• Insertions of instances in aggregates representing relationships with repli- 
cated data, include existence verifications on the related object types. For 
example, to insert an instance of MACHINE-RESP, the existence of the 
machine and the manager must first be verified. This can be detected in 
the code and gives information on the object type identifiers. 

• Deletions of a single instance is done using the identifier of the instance. 

• Modifications designate the instance to modify through an object type iden- 
tifier. Additionally, modifications are often done on single instances. 

• Selections on object types based on the identifier. 

We can define the set of possible sources of a functional dependency. Let 
Ci be the set of anchors defined on Aj, where Ai is a vertex in the connection 
graph. Let Ci be the set of sets of attributes where the attributes in each set 
are used as identifiers in data manipulation operations and let )Ci be the set 
of possible keys derived from various sources, e.g. index definitions. The set 
of possible sources of the functional dependency Si is defined as: 

Si = CiUCiUJCi . 



Note that Si consists of sets of sets of attributes. 

Computing targets of FDs 

Given that the possible sources of FDs can be determined as defined above', 
we proceed to verify whether these sources actually are sources of FD and if 
so which attributes are their targets. We are interested in elementary FD, and 
it is therefore enough to check single attributes of an aggregate for being the 
target of a given source of FD. For every X e CiUCi, X is the source of the 
functional dependency X -yY if 

1. there is an attribute Y such that X - Y = X and 

2. X -^Y satisfies the definition of an FD. 

There may be aggregates including only possible sources, e.g. an aggregate 
that is used to model a many-to-many relationship and that includes only the 
identifiers of the object types in the relationship. In such an aggregate, only 
the identifier is a source of FD. For each X e 1C, X Y holds for every 
attribute in Ai. If there is no such Yin Ai, then X -> 0 holds. 

Let Ai be an aggregate occupying BAi number of blocks. Let Si be the set of 
possible FD sources in Ai. Let #<Si be the number of elements in Si. For each 
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source X £ Si, and for each attribute ^ G ^4^, we check that whenever ti[X] = 
t 2 [X], then also t\[A] = t 2 [A]. If this is not the case, A is not functionally 
determined by X. 

We assume that the number of blocks M that fit into main memory is 
smaller than the number of blocks occupied by Ai, i.e. > M. Since we 
are interested in all instances of Ai that agree on X, it is advantageous to 
group the instances based on X. Ai is therefore sorted with X as the sort key. 
The cost of the sort is BAilogBAi^ Given that Ai is sorted according to X, 
the verification of the constraint can be done in one single scan of Ai, i.e. 
in BAi disk accesses. The total cost of verifying one possible source is thus, 
BAilogBAi H- BAi the cost of checking all elements of S is: 

#<S * {BAilogBAi +BAi) 

The cost is thus proportional to the number of elements in Si and logarith- 
mic to the size of Ai. To calculate the FDs of the hundred attribute aggre- 
gate, we assume that there is a search condition on 10% of the attributes, i.e. 
#5 = 10. Further, we assume that BAi is 30000, about 15 Mbyte. If the time 
to read a block is 20 msec, the calculation will be finished in under 10 hours. 
Clearly, this example is based on assumptions, but it still gives a picture of 
the magnitude of the time needed to compute functional dependencies. 



7 RELATED WORK 

Related work in this domain mainly falls into three categories as follows: 

• Translation from relational to entity-relationship models: 

A number of approaches exist for translating from the relational to the 
entity-relationship model, e.g. (Dumpala and Arora 1981), (Navathe and 
Awong 1987), (Casanova and de Sa 1983), (Markowitz and Makowsky 
1990), (Johannesson 1994), (Davis and Arora 1988), (Shoval and Schreiber 
1993), (Fonkam and Gray 1992), (Premerlani and Blaha 1993), (Chiang, 
Barron and Storey 1994). 

These approaches have in common that they assume relational schemas in 
third normal form and forehand prior knowledge on inclusion dependen- 
cies. For older systems, this is an unrealistic assumption since the schemas 
may well contain redundancies for optimization purposes and since exter- 
nal key specifications are not always supported. Our work is a necessary 
prerequisite for this type of translation. 

• Information acquisition: 

A number of recent approaches address the problem of information ac- 
quisition. For example in (Castellanos 1993), a method is presented for 
deriving functional and inclusion dependencies from data and for con- 
structing an object-oriented schema. Other methods, focusing on analysing 
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application source code, are presented in e.g. (Signori, Loffredo, Gregori 
and Cima 1994), (Petit, Boulicaut, Toumani and Kouloumdjian 1996), 
(Hainaut, Englebert, Henrard, Hick and Roland 1995). The few existing 
proposals for extracting an entity relationship schema from the data struc- 
tures of a COBOL application are mostly limited to searching information 
in the COBOL data structure (Davis and Arora 1986), (Nilsson 1985). 

• General-purpose tools: 

In the software engineering community, work has been done on general- 
purpose tools for program analysis. These include techniques such as e.g. 
program slicing in (Henrard, Hick, Roland, Englebert and Hainaut 1996) 
and (Weiser 1984) and variable slicing in (Chen, Tsai, Joiner, Gandamaneni 
and Sun 1994) and (Joiner, Tsai, Chen, Subramanian, Sun and Gandamaneni 
1994). However, these systems do not propose techniques for identifying in- 
tegrity constraints. A typical example is the method proposed in the REDO 
project (van Zuylen 1993) for constructing an extended entity relationship 
schema based on the data definitions of a COBOL application (Sabanis 
and Stevenson 1992) that does not exploit the powerful source code anal- 
ysis tools elaborated within other parts of the project. 



A more detailed survey of these approaches is given in (Andersson 1995). 



8 CONCLUSION 

We have presented a method to obtain information about the data structure 
of a COBOL program. We have identified: 



• The structure of variables for which no detailed declaration exists in the 
source code. 

• References representing various types of relationships between object types 
modelled in the data structure. 

• Functional dependencies between attributes of the COBOL data structure. 



This information can be used to create a schema representing the data 
structure of the COBOL program in any model, be it a semantic data model 
or the object-oriented model. 

The method has only been partly implemented and it has not yet been 
possible to test it on a real case. However, a preliminary study has been made 
on the applicability of the method to a real system used to administrate the 
students at a university. Here, data was represented with a relational-like 
data model and with applications written in a proprietary 4GL ressembling 
COBOL. In the study, it was found that it is possible to find the relationships 
between aggregates by analysing the program code. 
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Abstract 

This paper addresses the problem of recovering object-oriented schemata from 
relational databases. Solutions to this problem are particularly useful for de- 
signing wrappers for federated database systems. Our goal here is to describe 
a reverse engineering methodology for the DOK federated database system 
(Tari et al 1996), enabling the wrappers to express relational schemata as 
object-oriented schemata which are made available for different DOK’s ser- 
vices. 

The reverse engineering methodology we propose involves two main steps. 
The first details the classification of relations to reflect the different object- 
oriented constructs, whereas the second step consists of applying a set of rules 
to generate these constructs based on the different information contained in 
local databases, both at the schema and data level. 

The classification of a relational schema consists of partitioning relations 
into three categories: base relations (relations which form the core classes of 
the target object-oriented schema), dependent relations (relations describing 
binary relations between classes), and finally composite relations (relations 
describing ternary relationships between classes). 

The translation of the classified relations into object-oriented constructs 
is performed by analysing them according two levels of correlation: (i) the 
degree of correlation between keys of the relations, (ii) and the degree of 
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correlation between the tuples of the relations. Analysing these correlations 
uncovers implicit classes and relationships contained in relational schemata. 



Keywords 

Reverse engineering, Integrity constraints, Relational databases. Normal forms 



1 MOTIVATION 

At the Royal Melbourne Institute of Technology (RMIT), we are currently 
designing a distributed system which provides federated services that enables 
cooperative processing across different databases. This system called the DOK 
(Distributed Object Kernel) (Taxi et al 1996), is defined as a CORBA-based 
extension to enable federated processing. It includes a set of core services, 
including a security service (Tari et al 1997, tari 1997, tari 1998) (allowing 
the enforcement of both local and federated security policies), a transaction 
service (enabling the management of federated transactions), a query service 
(Savnik et al 1998) (allowing the decomposition and optimisation of feder- 
ated queries), a trader service (providing a mechanism to find DOK objects), 
a reflective service (Edmond et al 1995) (providing “additional semantics” 
of local information with regards the location and the capabilities of DOK 
objects), and a mining service (allowing the “extraction” of hidden semantics 
embedded within distributed and heterogeneous databases). These services 
are designed independently of any database platform or model and use in- 
formation (or objects) defined as virtual representations of physically defined 
objects. As we will see later, these DOK objects are called virtual objects (Tari 
et al 1996). 

To enable the DOK services to perform specific functions in a distributed 
environment, local-defined schemata are required to be transformed into a 
representation which can be used and monitored by these services. The com- 
ponent of the DOK system enabling such a transformation is called the r eengi- 
neering service. It takes a schema (defined in the relational model) as input 
and produces an object-oriented schema (as a set of virtual objects). The 
generated objects are then used by different DOK managers (e.g. transaction 
manager, query manager, etc.) to have an “understanding” of the information 
embedded in local relational databases enabling decomposition and optimisa- 
tion of federated queries, retrieval of objects, etc. 

This paper addressees the design of the DOK reengineering service. We 
assume that local databases support the flat relational model. The proposed 
methodology first partitions a relational schema into groups of relation to 
reflect the different object-oriented constructs. In our approach, we distin- 
guish between three types of relation: base relations, dependent relations and 
composite relations. Base relations being those which are not dependent on 
other relations (that is they do not contain foreign keys). Dependent relations 




186 



Part Three Reverse Engineering 



express binary relations between classes (such as simple and nested aggre- 
gations) while composite relations express ternary relationships. During the 
reengineering process, accurately distinguishing between dependent and com- 
posite relations is crucial as some dependent relations may “look” like com- 
posite relations. A good example is a relation Address containing two foreign 
keys, the key of the Doctor relation and the key of Hospital relation. Assum- 
ing that the key of Hospital is also a foreign key of Doctor, then the relation 
Address is not a composite relation but a set of two binary relationships; one 
simulates an aggregation from Hospital class to Doctor class, and the other 
one is an aggregation from the Doctor class to the class Address. 

After relations are classified according to the three types of relations, the 
reengineering process proceeds as follows. In the initial stage, base relations 
are mapped into classes of the target object-oriented schema. These classes 
will be used as core classes to design the whole schema in an iterative way, 
by adding different types of relationship as well as introducing new classes. 
Thus, the identification of relationships, such as aggregation and inheritance 
relationships, is a crucial step in the design of the reengineering service. Our 
approach deals with this problem by analysing not only the information con- 
tained in a schema but also the data stores. Our reengineering methodology 
provides for the different mappings (12 in total) which are based on the cor- 
relation between keys of relations (4 cases) and on the correlation between 
the tuples of relations (3 cases). The key correlation cases fall into one of the 
situations where a foreign key; (i) has no common attributes with the primary 
key, (ii) is a part of the primary key, (iii) equal to the primary key, and (iv) 
has an not empty intersection with the primary key. The tuple correlation 
involves the cases where: (a) all data of the foreign key are part of the initial 
relation (equality dimension), (b) the data related to the foreign key* has a 
non-empty intersection with the initial relation (overlap), and (c) the data 
related to the foreign key has an empty intersection with the initial relation 
(disjunction). 

Because of the two dimensional nature of key and data correlations, our re- 
verse engineering methodology provides a deeper insight into understanding 
the embedded semantics within local databases so that object-oriented infor- 
mation can be easily extracted, especially the cases of (simple and nested) 
aggregations and (simple and multiple) inheritance relationships. 

This paper is organised as follows. Section 2 describes the different ap- 
proaches for database reengineering and puts the DOK approach in context. 
Section 3 outlines the principle of our reengineering methodology. Section 4 
shows how object-oriented constructs can be identified from explicit or im- 
plicit information embedded within a relational database. Finally, Section 5 
concludes with our future work. 



* As the reader will notice, in the rest of the paper, we will call this type of key an external 
key instead of foreign key. The use of such terminology in this section is just to illustrate 
our approach, and therefore more clear definitions will be provided later. 
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2 RELATED WORK 

Most of the research on database reverse engineering (DBRE) has one of 
three perspectives: a model perspective using scheme analysis techniques, a 
query perspective using a query language (such as SQL), or a data perspective 
based on the analysis of data. This section gives a brief overview of these 
approaches and puts the DOK reverse engineering approach in context. 



2.1 The DBRE Approaches 

(a) Relational-based Schema Analysis 

Most of the current DBRE research falls into this category which uses the 
relational schema as the basic input and aim to extract the semantic infor- 
mation by an exhaustive analysis of each relation in the schema and their 
key and non-key attributes. The approaches in this category, e.g., (Chiang et 
al 1994, Fonkam et al 1992, Castellanous 1993, Johannesson 1994), assume 
that the schema input is in at least 3NF which easily allows the methodologies 
to identify candidate classes. 

Catellanos (Castellanous 1993), Chiang (Chiang et al 1994) and Fonkam 
(Fonkam et al 1992) have very similar approaches to (Johannesson 1994), 
although they differ in some respects. Johanesson’s approach (Johannesson 
1994) proposes three basic transformations that are repeatedly applied to pro- 
duce the re-engineered object-oriented schema: candidate key splitting^ inclu- 
sion dependency splitting and folding transformations. Candidate key splitting 
and inclusion dependency splitting are used to transform relations containing 
more than one object type into multiple relations containing just one object- 
type thus preserving a one-to-one relationship between relations and object 
types. The folding transformation removes relations occurring with a cycle 
of generalisation indicating inclusion dependencies to preserve the one-to-one 
correspondence between relations and object types. 

Chiang’s approach additionally deals with the establishment of generalisa- 
tion hierarchies, the determination of regular entities and weak entities, and 
the deriving of many-to-many and one-to-many relationships. This is achieved 
through the classification of relations (e.g. strong, weak, regular and specific 
entities), attributes and their inclusion dependencies. Fonkam’s approach also 
derives the generalisation hierarchies, however this is based on the analysis of 
view definitions. 

(b) Query-based Analysis 

These approaches are based on the use of query language statements to ex- 
tract (some of) the semantic information stored in a relational database, e.g., 
(Abderson 1994, Petit et al 1994). Andersson (Abderson 1994) analyses equi- 
joins statements whereas Petit’s approach (Petit et al 1994) analyses auto- 
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joins, set operations and where-by clause statements as well as the equi-join 
statements. 

The approach proposed in (Abderson 1994) extracts a conceptual schema 
(or an entity relationship schema) by analysing SQL statements with respect 
to where-in clauses for key attributes. The use of the keyword distinct in 
a query implying non-unique values in the attribute, thus eliminating this 
attribute from being the key. The join conditions in SQL statements are used 
to represent the edges of a schema. 

The approach proposed in (Petit et at 1994) starts from the database 
schema gaining the knowledge of relation names and the attributes and then 
extracts the semantic information for the relevant relations from the available 
queries. Petit considers three cases in an equi-join query. Assuming that there 
is an equi-join with one key (K), the cases for the attribute (A) are A = K, A 
C K, or A ^ K. In each case, algorithms are proposed to generate appropriate 
relationships. 

(c) Data-based Analysis 

These approaches are based on the analysis of data instances to understand 
(some of) the semantics of a database application. Premerlani’s approach 
(Premerlani 1994) is the most well-known. It is a fairly informal process re- 
quiring a lot of involvement from the user, with weakly ordered steps that 
entail much iteration, backtracking and reordering of steps. This approach 
has two main steps: identifying classes and the identification of different types 
of relationships between classes. In the initial step, candidate keys are identi- 
fied by looking for unique indexes, automated scanning of data and semantic 
knowledge that suggest patterns in data. Foreign-key groups are then identi- 
fied by first resolving homonyms and synonyms and then reviewing matching 
names, data types and domains which may suggest foreign keys. Generali- 
sation and aggregation relationships are identified by analysing foreign-key 
groupings. 

(d) “Pot Pourri” 

Although most of the approaches fall in one of three approaches previously 
presented, there are some solutions which have been proposed in the liter- 
ature and these mainly deal with the design of reengineering architectures 
for database systems. The work proposed in (hainaut et al 1993) is partic- 
ularly useful to understand the requirements for reverse engineering support 
(hainaut et al. 1995). It does not propose any specific algorithm but details 
a generic process model and the main schema transformations useful for the 
reengineering processes. These transformations include project-join, extension, 
and identifier substitution. 

The work presented in (Signore et al. 1994) overlaps all the three categories, 
and therefore based on the identification of schema, primary key, SQL, and 
procedural indicators that lead to the assertion of Prolog facts and by using 
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heuristic rules to the development of a conceptual schema. The four indi- 
cators extract information from different sources. Schema indicators provide 
information about the structure of the relations and are extracted from the 
DBMS catalogue and from knowledge gained whilst identifying the keys. Key 
indicators define the property of primary keys. SQL indicators, obtained from 
parsing SQL statements, detail the way relations are used in data access and 
manipulation. Procedural indicators are obtained from the host language code 
analysis and identity data and control patterns. 



2.2 Our Approach 

The reverse engineering approaches described above are useful, however do 
not fully take into account all the constraints inherent in the relationships 
between the sets of tuples in the relations. In this way, only a few object- 
oriented concepts can be identified. All the reverse engineering approaches 
use key analysis techniques (both the primary and foreign keys) to elicit the 
semantics between relations. But they assume that the set of tuples in the 
relations are restricted to the tuple equality condition, and therefore they 
cover only a subset of constraints between the tuples of relations. However, in 
analysing the relations, additional semantics can be uncovered by examining 
the sets of tuples when their intersection is partial (tuple overlap) and when 
there is no intersection (tuple disjunction). Formally, given two relations, 
say a and b with Ka and Kb as their corresponding primary keys, we say that 
there exists: 

• tuple equality between the tuples of the relations a and b if and only if 
3[Kb] C b[Kb] or vice-versa, i.e. h[Ka] C a[Ko]; 

• tuple overlap between the tuples of the relations a and b if and only if 
a[Kb] n b[Kb] 7^ 0, B[Kb] - b[Kb] 7^ 0, and b[Kb] - a[K'i,] / 0 or vice versa, 
i.e. a[Ka] n b[Ka] ^ 0, b[Ka] - a[Ka] / 0, and a[i^a] - b[Ka] / 0; and 

• tuple disjunction between the tuples of the relations a and b if and only 
if ^[Kb] n b[Kb] = 0 or vice versa, i.e. a[Ka] H b[Ka] = 0. 

The DOK reverse engineering methodology makes no assumption on the 
state of the tuples with respect to the referential integrity constraints and 
analyses the relations, attributes and relation keys from the three dimensions 
of equality, overlap and disjunction to extract most of the semantics of re- 
lational schemata. The methodology consists of two steps: classification and 
translation. During the classification stage, relations that serve as core classes 
for the building of the entire object-oriented schema are identified. Such rela- 
tions are called base relations. Binary relationships between base classes are 
also identified (called dependent relations) and finally composite relations are 
derived. 
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Before any classification of the relations, the relational schema must be' 
normalised to produce relations in 3NF. There are many algorithms which 
can be used to generate such types of relation, e.g. (Britton et al. 1989). 
Thus the normalisation issue will be not addressed in this paper. However 
the main reason of starting the reengineering process with 3NF relations is 
due to the fact that the 3NF relations are the “best” structures which reflect 
the concepts of object-oriented data models. Further normalised relations, 
such 4NF and BCNF, “break” relations at the level of loosing the original 
structures of objects. Also, less normalised relations, such as INF and 2NF, 
may leave relations containing many objects which are difficult to “separate” 
during the reengineering process. 

When the relational schema is normalised, we then classify relations into 
the three types previously described. The translation stage maps base rela- 
tions into their corresponding core classes. Later, dependent and composite 
relations are translated according to (i) the appropriate dimensions (equal- 
ity, overlap, and disjunction) and (ii) the degree of correlation between their 
primary and external keys. The former allows the identification of implicit 
information embedded within a relational database, whereas the later de- 
rives the type of relationship (stronger or weaker) according to the degree of 
inter-dependency between classes. This inter-dependency is measured by the 
following four cases: 

• (Case 1) PK fl EK = 0: the external key and the primary key do not share 
any attributes; 

• (Case 2) PK D EK: the external key is part of the primary key; 

• (Case 3) PK = EK: the external key is the primary key; and 

• (Case 4) PK fl EK ^ 0, PK - EK 0 and EK - PK 0: the external key 
and the primary key share common attributes. 



PK and EK denote the primary key and the external key respectively. As we 
will see later, an external key of a relation, say i?, is a set of R’s attributes 
which is the primary key of another relation, say iZ', however without having 
systematically inclusion dependencies. If the referential integrity constraint 
holds, that is R[EK] C R'[EK]^ then the external key is called foreign key. 

Finally, there are major differences between DOK and the existing reverse 
engineering approaches. With the schema-based approach, DOK deals not 
only with one dimension, that is the equality dimension, but also with the 
overlap and disjoint dimensions. With the query-based approach, where the 
recovery of the different joins between relations is performed by analysing the 
user’s queries, DOK analyses the correlation of the stored data and there- 
fore recovers a wider set of joins which are not automatically discovered from 
the query results. With Rumbaugh’s approach, DOK has a limited perspec- 
tive in regard to dependency constraints. This is mainly because Rumbaugh’s 
approach derives a variety of constraints from the stored data, whereas the 
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DOK approach assumes that these constraints are available within the rela- 
tional databases. However from the analysis perspective, the DOK approach 
has substantial advantages because it deals with both key and data correla- 
tions. 



3 THE DOK REFERENCE MODEL 

This section describes the DOK model used in the paper to support our reengi- 
neering approach. The aim of this reengineering approach is to generate DOK 
schemata, defined using the DOK reference model, based on the information 
contained in local databases. The assumption made in the DOK project is 
that the local database systems support relational and object-oriented data 
models only. In this paper, we will just provide an overview of the basics of 
DOK distributed object model. For more details, the reader may refer to (Tari 
et al 1996). 

As pointed out at the beginning of this paper, the aim of the DOK project 
is to design and implement a cooperative database system to enable effi- 
cient communication and computation across different database platforms. 
The DOK system, as shown in Figure 1, involves a set of managers which 
oversee the smooth running of a federated system and is responsible for en- 
suring the operational requirements of a federation. Users interact with a fed- 
eration through the local external schema of one of the component databases, 
implemented in the local wrapper. Users’ requests involving remote data are 
analysed by the local wrapper and re-directed to the DOK Manager, which 
has to ensure proper transaction, concurrency control and query management. 

The wrappers play an important role in the DOK environment. They are re- 
sponsible for the translation of local database schemata and provide advanced 
functions of negotiation and communication allowing the DOK system to un- 
derstand the semantics embedded in local database applications, to identify 
potential systems to perform specific tasks, and to negotiate the execution 
of the tasks. Also the wrappers are in charge of enforcing different levels of 
autonomy, thus allowing a customised process of cooperation. 

Distributed applications are designed around the concept of a virtual object 
which describes an abstraction defined from various data stores* located in 
different databases. Specifically, a virtual object is defined as a set of attribute 
“references” which point to already defined objects in local databases, called 
physical objects. The main difference between virtual and physical objects 
is that the former are not physically stored in the local databases, however 
they correspond to virtual representations of global abstractions used by dis- 
tributed applications. 

DOK schemata are defined as a set of virtual objects with a set of rela- 



*In this paper, we refer to a data store as either a tuple, an object, a record or any data 
contained in a local database. 
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Figure 1 The DOK Physical Architecture 



tionships between these objects. Relationships between objects are typically 
extensions of well-known aggregation and inheritance relationships to deal 
with heterogeneous data. We can imagine a DOK federated schema mod- 
elling a University application, where for example we need to record informa- 
tion about diflFerent departments and their staff members, details of students, 
including their personal information, pictures, and final marks, etc. In a sim- 
plified way, we will need to define three virtual objects. Student, Staff and 
Department. Each of these virtual objects is defined by a set of attributes 
which are defined by “picking up” information from three databases: personal 
database (pDB - which stores information about staff members of different 
department of a given university), a student database (stDB - which stores 
information about students and their results), and a bitmap database (bitDB 
- which stores pictures of both staff and students of different departments). 
The virtual object Department is built by references to information located 
in the databases pDB and bitDB. In a similar way, the virtual object Student 
contains three types of information: Looks - like (which refers to a picture in 
bitDB), Personal -inf ormation (which refers to a view of stDB) and Results 
(which is a SQL query on stDB constructing the results of a student). 



4 THE DOK REE APPROACH: AN OVERVIEW 

This section outlines the main principles of the DOK reengineering approach. 
The remaining sections provide details of the different steps required to gener- 
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ate object-oriented schemata from relational databases. The DOK reengineer- 
ing approach differs from existing approaches with regard the identification of 
different object-oriented relationships. The identification of such relationships 
is crucial in the recovery of object-oriented specifications because they can 
be used to improve the “quality” of the final object-oriented design (Tari et 
al 1997). 

In most approaches for database reverse engineering, the first step consists 
of recovering the core classes of the final object-oriented schema. These ‘struc- 
tures’ are then re-used through different relationships to define more complex 
structures. These core classes are identified by looking for some specific type 
of relations, called base relations, that have the specific ‘characteristic’, that 
they are not logically dependent on the remaining relations of a relational 
schema. The process of identifying and translating base relations is relatively 
straight-forward where there is a one-to-one mapping between base relations 
and core classes and poses no great difficulties in the translation process be- 
tween a relational schema and an object-oriented schema. 

After base classes are identified, the next step, which is the crucial step of 
any reengineering process, is to identify the different relationships between 
classes. These classes include those already identified at the beginning of the 
reengineering process, as well as those which are derived during this process, 
such as dependent classes (classes that are existentially dependent on others). 
Our approach for the identification of the different types of relationships be- 
tween classes is based on (i) the analysis of relation keys defined within the 
relational schemata and (ii) the analysis of constraints defined explicitly in 
referential integrity constraints and implicitly in data sources. Thus giving a 
reengineering methodology based on the analysis of the semantics reflected 
by the specification of the relational schemata and on the semantics provided 
within the data sources. 



1. In analysing the different keys of a relational schema, the focus of our ap- 
proach is on the correlation between the primary keys and external keys. 
The former are well-known concepts and therefore are not explained in 
this paper. The latter is a category of key constraints which “relax” one 
of the condition related to foreign keys. Informally speaking, an exter- 
nal key of a relation represents a set of attributes which is a primary 
key of another relation and furthermore there are no conditions on the 
tuple inclusion between the data stores of the two relations. A simple 
example of an external key will the attribute student Jd of the relation 
AddTess{studentJd, zip, street, town), where the data stores related to 
student Jd in this relation, i.e. Addiess[studentJd\, is not a subset of the 
original relation Student (student Jd, name, age): Address[studentJd] ^ 
Student [student Jd] . 

This new concept of an external key introduces a more realistic require- 
ments when dealing with the reengineering of relational databases as in 
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most of database applications, the key constraints do not naturally reflect 
the assumptions of the Universe of Discourse (UoD), as shown in the above 
example for student Jd. 

The analysis of the semantics reflected within a relational schema is based 
on the correlation between primary keys and external keys enabling the 
identification of a wider number of hidden relationships between classes 
than in existing reengineering approaches. Four cases of key correlations 
are covered in our analysis: (1) the external key is a non-key attribute, (2) 
the external is a component of a composite primary key, (3) the external 
key is the primary key, and finally (4) the external key is partially used as 
a component of a composite primary key. These four cases are graphically 
shown in Figure 2, where PK denotes a primary key and EK denotes a 
external key. 

2. The other major issue in reengineering of relational databases is the analysis 
of the data sources to extract additional semantics which can be used to 
derive hidden object-oriented structures and relationships. All the existing 
reengineering approaches assume a strict adherence to referential integrity 
constraints defined on relational schemata, and thus restricting the values 
of data stores in the relations. This can not be guaranteed, so the degree to 
which the two relation’s data sources intersect (i.e. adhere to the referential 
integrity constraints) identifies the semantics of the relationship between 
relations. The three dimensions for the intersection of the sets data sources, 
called tuple inclusion^ are equality, overlap and disjunction. Figure 3 shows 
graphically the three different tuple inclusions, where: (1) equality is defined 
as Ri[A] C R 2 [B] (or vise- versa), (2) overlap defined as Ri[A] - R 2 [B] ^ 
0, R 2 [B] - Ri[A] ^ 0 and Ri[A] fl R 2 [B] ^ 0, and (3) disjunction defined 
as Ri[A] n R 2 [B] = 0. 
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Figure 2 The Four Cases of Key Correlation 
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Figure 3 The Three Cases of Data Correlation 



The combination of the key correlation (of Figure 2) and data correlation (of 
Figure 3) in the reengineering process induces twelve (12) cases from which 
we obtain both the explicit and implicit semantics of relational databases 
which can be explicitly specified within object-oriented schemata. Each of the 
different cases yields a different object-oriented constructs from the relational 
schema as is shown in Figure 4. 

In describing our reverse engineering approach, we first discuss the analysis 
of the relational keys in a relational schema and then the analysis of referential 
integrity constraints using data stores of relational databases. We illustrate 
our methodology by using the following relational database schema of a Hos- 
pital application given below. Note that primary keys are denoted by capital 
letters. Foreign and external keys are not specified in the Hospital schema 
and they will be discussed in the different scenarios related to the 12 cases. 
The reader may notice that the equality dimension^ as shown in Figure 4, 
is the situation where external keys are in fact foreign keys. The remaining 
dimensions, that is the overlap and disjoint dimensions, are concerned with 
external keys which are not foreign keys. 



Patient (PATIENT^ ID , patient _name , address , symptoms) 
Doctor (D0CT0R_ID , doctor_name , salary , specialisation) 
Ward(WARD_ID, title) 

Hospital (HOSPITAL_ID,hospital_name) 

Laboratory (LAB0RAT0RY_ID , laboratory .name , lab.no) 
Phone (HOSPITAL.ID , PHONE.NO) 

Handled (PATIENT.ID , DOCTOR.ID) 

Registered (PATIENT_ID,WARD_ID) 

Staff (DOCTOR.ID , HOSPITAL.ID) 

Head (HOSPITAL.ID , doctor. id) 

Resident-In (WARD.ID, DOCTOR.ID) 

Facilities (HOSPITAL.ID, LABORATORY.ID) 

Address (HOSPITAL.ID , PHONE.NO , code , zip , town) 
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Figure 4 Summary of Object Recovery based on Key & Data Correlations 



5 OBJECT RECOVERY 

The recovery of objects from relational databases is a two-step process, where 
the core classes are initially extracted from the original database schema and 
define the core classes of the target object-oriented schema. Since these core 
classes are just skeletons of the target schema, additional classes and relation- 
ships are to be recovered from the relational database by analysing both the 
information provided in the schema (that is according to the key correlation 
between the relations of the schemata) and the information provided by the 
data stores (that is according to data correlation of data stores). 

The first step of this recovery procedure is based on the classification of 
the different relations of a relational schema to reflect the different object- 
oriented constructs. Our approach distinguishes between three different types 
of relations: base relations, dependent relations and composite relations. Base 
relations define the core clctsses of object-oriented schemata, whereats depen- 
dent and composite relations are used to recover the different relationships 
between classes. This section provides the following: (i) the classification rules 
enabling to partition a relational schema to reflect the different object-oriented 
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constructs, and (ii) the mapping rules which recover the hidden semantics 
in the schema and data stores. In some of the 12 cases, we illustrate the 
implementation of the generated object-oriented schema using ObjectStore 
(objectstore 1996). 



5.1 Class Recovery 

Here we describe the basic techniques to recover classes from a relational 
database. Details about their corresponding algorithms can be found in one 
of our initial work on database reengineering (Tari et al. 1997). 

(a) Principle 

We use the key constraints of relational schemata as as basis for the iden- 
tification of classes. For example, the attribute DOCTORJD of the relation 
Doctor shows the existence of a class Doctor since this attribute is a primary 
key of the relation Doctor. In addition, this relation is not composed from 
any other keys, which means that its can be used as a core relation. Relations 
exhibiting this property are called base relations, and their corresponding 
classes are base classes. Since base classes are not composed from any other 
information, they will be used as the foundation to build more complex classes 
by either using inheritance, aggregation or association. 

Let us now consider the relation Phone. This relation contains only one 
reference to another Hospital which is K Hospital = HOSPITAL JD. This ref- 
erence shows that the class Phone will be dependent on the base class Hospital. 
If we take the case where there exists a referential integrity constraint between 
the relation Phone and the relation Hospital, i.e. the equality dimension (Fig- 
ure 4, Case 2), the class Hospital will contain an attribute that will reference 
Phone, which in fact is consistent with the inclusion dependency assumed 
above that states that access to the information of Phone can be achieved 
only if the hospital information is known. However, if the analysis of the re- 
lations is performed according to the overlap dimension, that is the KnospUai 
in the relation Phone is an external key*, then the relationship between the 
classes is not an aggregation because of the existence of some data stores of 
the relation Phone which cannot be accessed from the relation Hospital. 

The relations that exhibit the same property as the relation Phone are 
called dependent relations and their corresponding object-oriented classes 
are called dependent classes which are completed with aggregation and 
inheritance relationships. 

However the above situation, in which a primary key may be composed of 
an external key, can be more complex. Indeed, we may have a set of relations, 
say n > 2, where there exists an inclusion between each or a 



*Phone[K Hospital] - ^OSp\ta\[K Hospital] ^ 0, ^OSpit^A[K Hospital] * Hospital] 0, 

stncl Pi Phon6[/^/fQ^pj^£j/] ^ 0. 
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pair of relations, i.e. Kr. C l<i<n-l. This also means that the rela- 

tion Ri, 2<i<n, is composed by a set of external keys Kr^ ,Kr^ ,• • -,Kr._^ . The 
corresponding class of the relation Ri will be dependent on the class of Ri-i 
which itself depends on the relation Ri -2 and so on. This situation will induce 
a nesting of aggregation relationships between classes that are derived from 
the inclusion amongst the primary keys of their corresponding relations. The 
relations iZ 2 ,* * ,Rn that exhibit this property are also called dependent re- 
lations. The relation Address is a good example of a dependent relation which 
contains a nesting of keys. Indeed, this relation has two external keys which 
are Knospitai = HOSPITALJD and Kphone = HOSPITAL JD, PHONEJ^O 
and furthermore K Address = HOSPITALJD, PHONE_NO. If in addition, we 
also have the properties; (i) Phone[K Hospital] ^ Hospital and (ii) 
Address[JRrp/ione] Q Phone[K Rhone] (i*e. we are dealing with the equality di- 
mension), we derive an aggregation relationship between Hospital and Phone, 
and an additional aggregation relationship between the class Phone and the 
class Address. The properties (i) and (ii) above create a nesting of aggrega- 
tions from the class Hospital to the class Address via the class Phone. 

The last category of relations is one where there is more than one refer- 
ence to another relation and furthermore there is no inclusion between the 
external keys (as there is for dependent relations). These relations are called 
composite relations and they generally express either associations or mul- 
tiple inheritance between classes. For instance, the relation Resident-In is an 
example of a composite relation, which specifies a relationship between two 
independent chunks of information, i.e. Doctor and Ward. Since this relation 
has no more additional attributes other than the external keys, this will be 
transformed into an association between the classes Doctor and Ward. 

The result of relation classification on the Medical schema is given below. 
As mentioned earlier, base relations are used to generate core classes of the 
target object-oriented schema, however dependent and composite relations 
enable to recover additional classes and relationships, and therefore complete 
the design of the schema. 

• Base relations: Patient, Doctor, Word, Hospital, and Laboratory. 

• Dependent relations: Phone and Address. 

• Composite relations: Handled, Registered, Staff, Head, Resident-In, and 

Facilities. 



(b) ObjectStore Implementation 

After the relations of a relational schema are classified, the next step in the 
process of translation is to generate the core classes of the object-oriented 
schema. These classes are directly derived from base relations and have the 
same attributes as those contained in the base relations. For example, from 
the relations Patient, Doctor, Ward, Hospital, and Laboratory of the Medical 
schema 
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Patient (PATIENT^ID, pat ient .name , address, symptoms) 
Doctor (DOCTOR.ID, doctor.name, salary, specialisation) 
WardCWARD.ID, title) 

Hospital (HOSPITAL.ID , hospital.name) 

Laboratory (LABORATORY.ID, laboratory .name, lab.no) 



the following ObjectStore schema is generated accordingly: 



/* schema. cc file ♦/ 

#include <ostore/ostore.hh> 
#include <ostore/coll.hh> 
#include <ostore/manschem.hh> 
#include "Patient . hh” 

#include "Doctor. hh” 

#include "Ward.hh" 

#include "Hospital. hh" 
#include "Laboratory .hh" 



void dummy () { 
OS.MARK_SCHEMA.TYPE (Patient) ; 
OS.MARK.SCHEMA.TYPE (Doctor) ; 
OS_MARK_SCHEMA_TYPE(Ward) ; 
OS.MARK.SCHEMA.TYPE (Hospital) ; 
OS.MARK.SCHEMA.TYPE (Laboratory) ; 
} 



OS.MARK.SCHEMA.TYPE (os_Set<Patient»; 
OS.MARK.SCHEMA.TYPE (os.Set<Doctor» ; 
OS.MARK.SCHEMA.TYPE (os_Set<Ward» ; 
OS.MARK.SCHEMA.TYPE (os.Set<Hospital>); 
OS.MARK.SCHEMA.TYPE (os.Set<Laboratory» ; 



The specification of OS_MARK_SCHEMA_TYPE(os_Set<X>) in the schema 
introduces a persistent structure for the objects of the class X. This set struc- 
ture is particularly important for the base classes because most of the queries 
on the target object-oriented schema will have an “entry point” as a base class 
and will later use the different (access) paths to navigate through other classes 
of the object-oriented schema. Our claim is that all base classes must have a 
set persistent data structure (for the reason given above) and, probably, the 
database administrator can introduce additional set structures for the classes 
to be recovered in later stages of the reengineering process. 

In addition to the generated file schema. cc, other files are generated to 
provide the specifications of the base classes Patient, Doctor, Ward, Hospital, 
and Laboratory. These classes are typically C-f + classes which have the same 
attributes as their parent relations. For the limited size of the paper, we just 
propose the header file of the class Laboratory. 

/* File Laboratory . hh ♦/ 
class Laboratory; 

extern os.Set<Laboratory*> ♦laboratory.extent ; 
class Laboratory 
{ 
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protected: 

int laboratory_id; 
chair* laboratory.name ; 
int lab_no ; 
public: 

Laboratory (int aa, char* bb, int cc) 

{ 

laboratory.id = aa; 
lab.no = cc 

int length = strlen(bb) + 1; 

laboratory .name = new(os.segment : :of (this) , 

os.typespec: : get. char 0 , length) char [length] ; 
strcpy (laboratory. name , bb) ; 

/* insertion of the object in the set */ 
laboratory.extent->insert (this) ; 

} 

/* deletion of the object from the set */ 

''Laboratory (){ laboratory.extent->remove(this) ; } 
static os.typespec* get.os typespecO; 

}; 



5.2 Relationship Recovery 

By examining keys, and in particular the correlation between the relations’ 
primary and external keys, we are able to completely elicit the semantics of 
the relationships between relations. Here we examine each of the four cases of 
key correlation of Figure 2 in details by looking at the different situations of 
data correlation of Figure 3. 

(a) External Key - Non Key Attribute (EK-NKA) 

The case where the external key in a relation is a non-key attribute illustrates 
a weak relationship between classes, which is mainly characterised by the fact 
the external key has no common attributes with the primary key. Indeed, if 
a relation Ri is a dependent or composite relation and an external key 
of i?i, and furthermore Kr^ n Kr^ = 0, then the relationship between the 
class Cr^ and Cr^ cannot be an aggregation because the identification of the 
data stores of the relation R\ does not use any information of the external key 
Kr^ . We now examine this case individually with the different dimensions of 
tuple inclusion as illustrated in Figure 4. 



(a-1) Equality dimension 

When tuple inclusion is complete, the relationship is an association, meaning 
that the classes are weakly related. This relationship is an association (instead 




The reengineering of relational databases 



201 





(a) Equality 

Figure 5 Object Recovery in the Case “EK-NKA” 



of a aggregation) because the external and the primary key share no common 
attributes (ie EK-NKA) therefore, the data stores of the corresponding classes 
are weakly related. A good example of such type of scenario is the the following 
situation where the relations Doctor and Hospital, defined as, 

Hospital (HOSPITAL.ID , hospital.name) 

Doctor (D0CT0R_ID, doctor _name, salary, specialisation, hospital.id) 
have the following inclusion dependency: 



Doctor[hospitalJd] C Hospital[HOSPITALJD] 



In this instance, the dependent relation is mapped to a class with an as- 
sociation relationship between this class and the base relation class. This 
relationship is illustrated in Figure 5(a). The external key hospital Sd in the 
dependent relation Doctor, which is also in this case a foreign key because we 
are dealing with the equality dimension, simulates an association between the 
two relations. It has no common attributes with the primary key of the rela- 
tion Doctor, therefore the recovered relationship between the classes Doctor 
and Hospital could be only a weak relationship, that is an association. 

When the relationship is recovered, the original C-f + classes Hospital and 
Doctor are updated to include the new association. 



/♦ File Hospital. hh ♦/ 
class Hospital; 
extern os_Set<Hospital*> 
♦hospital^extent ; 
class Hospital 
{ 



/* File Doctor. hh ♦/ 
class Doctor; 
extern os_Set<Doctor*> 
*doctor_extent 
class Doctor 
{ 
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protected: 

int hospital^id; 
ch 2 ir* hospital_name; 
os_Set<Doctor*> doctors; 

public : 

Hospital (int aa, char* bb) 

{ /♦ constructor ♦/ } 

'"Hospital (){ 

hospital_extent->remove(this) ; 
static os_typespec* 

get_os_typespec() ; 

}; 



protected; 

ch 2 LT* doctor .name ; 
int salary; 
char* specialisation; 
os_Set<Hospital*> hospitals; 
public; 

Doctor (char* aa, int bb, char* cc) 
{ /* constructor */ } 

"Doctor(){ 

doctor_extent->remove(this) ; 
static os_typespec* 

get_os_typespec() ; 

} 



(a-2) Overlap dimension 

When there is only partial inclusion of data stores between the external key 
of one relation (first relation) and the primary key of another relation (second 
relation), this creates a different situation from the one previously presented 
(i.e. equality dimension). Here we recover the hidden semantics as a combi- 
nation of an association and an inheritance relationships. 

(1) The first relation is divided into two fragments, say cc and tt, to simulate 
the fact that only a part of its data stores have an intersection with the data 
stores of the second relation. We assume that aa is the fragment that has 
a non-empty intersection. The fragment tt represents the hidden semantics 
which cannot be recovered if the equality dimension is only considered*. 

The mapping of the first relation will generate two classes related to the 
fragments cc and tt respectively. Since both of these fragments materialise the 
same concept (or relation), then an inheritance relationship will be recovered 
from the mapping of the first relation to semantically relates the class of aa 
with the class of tt. 

(2) Prom the second relation we will recover two types of information: (i) 
the class which implements this relation, and (ii) an association that relates 
its data stores of the class aa. 

Using our Medical example, we consider the relations Doctor and Hospital, 
defined as 

Hospital (HOSPITAL^ID , hospital_name) 

Doctor (D0CT0R_ID, doctor^name , salary, specialisation, hospital_id) 

with the following properties of the disjunction dimension: 



*A11 the existing reengineering approaches do not recover the class tt because they deal 
only with the equality dimension. 
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Doctor [hospitalJd] fi Hospital[HOSPITALJD] ^0 (pi) 
Doctor[hospitalJd] - Hospital[HOSPITALJD] ^ 0 (p 2 ) 

Hospital[HOSPITALJD] - Doctor[hospitalJd] ^ 0 (pa) 



The external key hospitaldd in the relation Doctor simulates an association 
relationship between data stores of the relation Hospital and a few data stores 
of the relation Doctor. These data stores of the relation Doctor form a frag- 
ment of the relation which is mapped into a class, called Hospital-Doctor. This 
class recovers those doctors which work in hospitals. The second fragment to 
be recovered relates to the data stores of the relation Doctor that do not have 
any relationship with those of the relation Hospital. For this fragment, we 
create a class that represents those doctors which do not work in hospitals. 
We denote by Doctor such a class. Since these the two classes Doctor and 
Hospital -Doctor relate to the same original relation, then an inheritance re- 
lationship is recovered to simulate the fact that: (i) Hospital JDoctar contains 
information about doctors, and (ii) Doctor contains all information about 
doctors, and particularly, it contains those data store that do not have rela- 
tionships with those of Hospital. The C4--I- classes specifications of the new 
object-oriented mapping are: 



/* File Hospital. hh ♦/ 
class Hospital; 
extern os_Set<Hospital*> 
♦hospital^extent ; 
class Hospital { 
protected: 
int hospital_id; 
char* hospital_name ; 
os_Set<Hospital_Doctor*> doctors ; 
public : 

Hospital (int aa, char* bb) 

{ /* constructor */ } 

"'Hospital (){ 

hospital_extent“>remove(this) ; 
static os_typespec* 

get_os_typespec() ; } 



/* File Doctor. hh */ 
class Doctor; 
extern os_Set<Doctor*> 

*doctor_extent 
class Doctor { 
protected: 
char* doctor^name; 
int salary; 
char* specialisation; 
public : 

Doctor (char* aa, int bb, char* cc) 
{ /* constructor */ } 

'"DoctorOf 

doctor_extent->remove(this) ; 
static os_typespec* 

get.os.typespecO ; } 



class Hospital.Doctor : Doctor { 
protected: 

os_Set<Hospitals*> hospitals; 
public: 

/* constructor & destructor */ } 



(a-3) Disjunction 

This is not applicable to case EK-NKA since there is no situation where you 
have no tuple inclusion. 
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(b) External Key - part Composite Key (EK-pCK) 

This second case is concerned with the situation where the external key of 
one relation (first relation) is a part of the primary key of another relation 
(second relation). Note that, in this context, the second relation cannot be 
a base relation because it contains the external key, and therefore can be 
only either a dependent or composite relation. When the second relation is a 
composite, the recovery of hidden semantics is quite complex, as detailed later. 
When the second relation is a dependent relation, the external key which is 
contained as a part of it’s primary key enables the recovery of an aggregation 
relationship because of the fact that all the data stores of this second relation 
relays on the external key in order to be identified. 




(b-1) Equality dimension 

In this case, with complete tuple inclusion, both the first and the second 
relations, are mapped into distinct classes. An aggregation relationship is 
recovered to link the class related to the first relation with the class which 
simulates the second relation. The direction of this containment relation will 
be defined from the first to second class. 

Let us again consider the Medical schema, restricted to the relations Hos- 
pital and Phone: 

Hospital (H0SPITAL_ID , hospital_name) 

Phone (HOSPITAL.ID, PHONE.NO) 
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with the inclusion dependency 

Phone[HOSPITALJD] C Hospital[HOSPITALJD] 

The external key HOSPITAL JD in the relation Phone simulates an aggre- 
gation relationship between the classes Hospital and Phone because there is 
a strict inclusion between the external key and the primary key of the rela- 
tion Phone. To depict this relationship in the object-oriented data model, the 
dependent relation Phone is mapped to a class Phone, and an aggregation 
relationship, from the base class Hospital to Phone, is recovered. Figure 6(a) 
shows the result of the mapping of the relations Phone and Hospital, and we 
have omitted the ObjectStore specifications of such a mapping because of the 
limited size of the paper. 

In the above example, the relation which contains the external key is a de- 
pendent relation and thus, the recovery of the aggregation is easier because 
only one external key exists in the relation. This is not the case in a composite 
relation, where there will be multiple external keys which could simulate ei- 
ther a set of aggregation relationships, a set of inheritance relationships, or a 
combination of both. The complexity involves determining which one of three 
cases applies to the composite relation. In our approach, this problem is solved 
as follows. When the corresponding classes of the external keys have a com- 
mon superclass, then the composite relation definitely simulates a multiple 
inheritance. In the other cases, assume that the composite relation contains 
only aggregations which need to be verified by the database administrator. 
This verification is necessary because these aggregations could in reality be a 
combination of aggregation and inheritance. Therefore an automatic mapping 
of a composite relation cannot be generated. 

There are two possible scenarios to the problem of differentiating between 
aggregations and combination of inheritance and aggregations. Here we briefly 
overview a possible solution for each of these scenarios. 



Recovering aggregations only: The composite relation is mapped into a 
set of aggregations which match the semantics of the relational schema. 
For example, from the following relations 

Facilities (HOSPITAL.ID , LABORATORY. ID) 

Hospital (HOSPITAL.ID , hospital.name) 

Laboratory (LABORATORY.ID, laboratory .name , lab.no) 



with the inclusion dependencies 

Facilities[HOSPITALJD] C Hospital[HOSPITAL JD] 
Facilities[LABORATORYJD] C Laboratory[LABORATORYJD] 
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we recover aggregation relationships only between the classes Hospital and 
Laboratory with the class Facility. Our approach is consistent in that the 
result reflects the intended semantics. 

Recovering both aggregation and inheritance: Let us assume we ex- 
tend the Medical schema and add a new relation AustralianPhone 

Hospital (H0SPITAL_ID , hospital_name) 

Phone (HOSPITAL.ID, PHONE.NO) 

Austral ieinPhone (PHONE^NO) 

with the following inclusion dependency 

Phone[PHONE_NO] C AustralianPhone[PHONE_NO] 

The relation Phone now becomes a composite relation with HOSPITAL JD 
and PHONE-NO as external keys. Hospital and AustralianPhone are base 
relations and therefore their corresponding classes do not contain any com- 
mon superclass. In this case the semantics of the composite relation is 
intended to be a combination of an inheritance (between Phone and Aus- 
tralianPhone) and an aggregation (between Phone and Hospital). It could 
also represents a set of aggregations between Phone and the classes Aus- 
tralianPhone and Hospital. 

To simplify the translation process, our methodology assumes that all the 
recovered relations are aggregation relationships. These must be later reflned 
by the database administrator to improve the design. As explained above, in 
some situations, there is no way to differentiate between an aggregation and 
inheritance relationship. Therefore, the role of database administrator will be 
to ’’correct” the recovery semantics to reflect the correct interpretation of the 
UoD (Tari et al 1997). 

We would like to point out that most of the existing reverse engineering 
approaches necessitates frequent involvement of the database administrator. 
In our approach, however, there exists one case only where the recovery sit- 
uation is ambiguous and therefore additional semantics cannot be extracted 
from the database schema. In all the remaining cases, our approach provides 
a complete set of translation rules to recover the hidden semantics based on 
the presented two dimensional correlation scheme. 

(b-2) Overlap dimension 

To simplify the description of our reengineering approach, as in the previous 
case, we consider the second relation (that is the one which contains the 
external key) as a dependent relation. Since we are dealing with the overlap 
dimension, only some of the data stores of the second relation are identified by 
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the primary key of the first relation, therefore, the recovery of an aggregation 
will be applied only for one part of the data stores of the second relation. 

There are multiple solutions to map the first and second relations, depend- 
ing on which point of view we are taking into account. Basically, the idea is 
to split either the first or the second relation in order to separate their com- 
mon data stores from the rest. If the focus is on the first relation, then two 
clcLSses are created, say a and b, where we assume that a does not have any 
common data stores with the second relation. Therefore, an aggregation is 
recovered to simulate the fact the external key, which is now in the class b, is 
used to identify the data stores of the second relation. Finally, an additional 
class, say ab, is recovered to materialise the whole first relation with multiple 
inheritance relationship with the classes a and b. 

Alternatively, mapping the two relations consists of splitting the second 
relation instead. We assume that a and b are the two classes, where a does 
not have any common data stores with the first relation. In the same manner 
as the first alternative, we recover the following classes and relationships: (i) 
a, b, ab are classes simulating the first relation, (ii) an aggregation between b 
and the second class, and (iii) a multiple inheritance between ab and a and b. 

Let us consider the relations Laboratory and Phone in the Medical schema 

Laboratory (LAB0RAT0RY_ID , laboratory_name) 

Phone (LABORATORY.ID, PHONE.NO) 

with the following properties 

Phone[LABORATORYJD] fl Laboratory [LABORATORY JD] ^0 (pi) 

Phone[LABORATORYJD] - LaboratoryfLABORATORYJD] ^ 0 (p 2 ) 

Laboratory[LABORATORYJD] - Phone[LABORATORYJD] ^ 0 (ps) 

then examining these relations, we can see that according to the property pi 
there are some Laboratories with a phone but also according to the property 
P3 there are some Laboratories without a phone. Additionally, the property p2 
implies that there must be some other rooms that are not Laboratories which 
also have a phone. Therefore, by using the property p2> we have discovered a 
new abstraction not specified (as a relation) in the relational schema. In the 
mapping process from the relational to an object-oriented schema, this new 
abstraction will be explicitly represented as a class. 

In above example, we recover the classes Phone, Laboratory, Laboratory 
-with_Phone and a class Room_with_Phone (see Figure 6 (b)). This last class 
has been recovered because of the overlap dimension. Existing reverse en- 
gineering approaches will not recover such a class because of the restric- 
tion made on the inclusion between data stores. Laboratory is a superclass 
of Laboratory _with_Phone but Laboratory _with_Phone is also a subclass of 
Room_with_Phone so there is a multiple inheritance relationship. The ag- 
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gregation relationship is between the class Room_with_Phone and the class 
Phone. 

(b-3) Disjunction dimension 

As for the two previous cases, we assume that the second relation is a depen- 
dent relation. Due to the size limitations of the paper, we will not discuss the 
case where the second relation is a composite relation. 

Since we are dealing with the disjunction dimension, there are no common 
data stores between the two relations. We would like to remind that the second 
relation contains an external key as a part of its primary key. The mapping 
of the two relations will recover the following object-oriented constructs: (i) 
two classes corresponding for the first and second relations, (ii) a new ab- 
stract class, say a, simulating the fact there exists a relation, with has the 
same attributes as the first relation, which shares data stores with the second 
relation*, (iii) a new abstract class, say b, enabling the sharing of common 
characteristics between the first relation and the class a, and finally (iv) a 
multiple inheritance between the classes b and a and the class corresponding 
to the first relation. 

If we consider the relations Laboratory and Phone in the Medical schema 

Laboratory (LABORATORY.ID , laboratory .name) 

Phone (LABORATORY.ID, PHONE.NO) 

with the following properties 

Phone[LABORATORYJD] fl Laboratory [LABORATORY JD] = 0 (pi) 

then examining these relations with the disjunction dimension holding, we can 
see that there are no Laboratories with a phone (pi) but there are other rooms 
with phones implied by the existence of the external key LABORATORY JD 
in the relation Phone, therefore, we create the classes Phone, Laboratory 
from the relations Phone and Laboratory respectively. Two new classes are 
recovered: the abstract classes Room and Room_with-.Phone (see Figure 6(c)). 
Laboratory is a subclass of the abstract class Room but Room_with_Phone is 
also a subclass of Room so there is are two simple inheritance relationships. 
The aggregation relationship is between the class Room_with_Phone and the 
class Phone. 

(c) External Key - Simple Key (EK-SK) 

This is an uncomplicated case where an external key is entirely composed of a 
simple key. This type of property simulates an inheritance (single or multiple) 

*This assumption is particularly useful in the context of Open World Assumption (OWA) 
where, the first relation may have new data stores which will be shared with the second 
relation. 
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relationship between classes where the type of inheritance is determined by 
the type of tuple inclusion between the relations. 




(c-1) Equality dimension 

In this instance, there is complete inclusion between the data stores of both 
relations where all the data stores in the dependent relation are included in 
the data stores of the base relation. From the dependent relation we recover 
a class and a simple inheritance relationship with the base class. 

If we consider the relations Employee and Doctor in the Medical schema 

Employee (EMPLOYEE.ID , employee_name) 

Doctor (DOCTOR.ID, salary, specialisation) 

with the inclusion dependency 

Doctor[DOCTORJD] C Employee[EMPLOYEEJD] 

then examining these relations with the equality dimension holding, we can 
see that all the tuples of Doctors are employees of the hospital. To map this 
relationship, a class Doctor is created with an inheritance relationship between 
it and the base class Employee. Figure 7(a) shows such a mapping. 

(c-2) Overlap dimension 

In this instance, the external key is comprised entirely of a simple key and 
additionally some data stores in the dependent relation are not contained in 
the base class and some of the data stores in the base class are not contained 
in the dependent relation. This property simulates a multiple inheritance re- 
lationship between classes. From the two relations, we extract the following 
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object-oriented constructs: (i) two classes which implement the two relations, 
and (ii) an additional class containing all the common data stores of the two 
relations. 

If we consider the relations Room and Laboratory in the Medical schema 

Office (0FFICE_ID, location) 

Laboratory (LAB0RAT0RY_ID , laboratory_name) 

with the following properties 

Laboratory [LABORATORY JD] fl Office[OFFICEJD] ^0 (pi) 
Laboratory[LABORATORYJD] - Office[OFFICEJD] 0 (p 2 ) 

Office[OFFICE JD] - Laboratory[LABORATORYJD] ^ 0 (ps) 

then examining these relations with the overlap dimension holding, we can 
see that some instances of Office are not included in Laboratory (ps) and 
some instances of Laboratory are not included in Office (p 2 ). Semantically, 
this implies that there are rooms classified as offices, rooms classified as lab- 
oratories and additionally rooms that are both offices and laboratories. To 
map this relationship, we create the classes Laboratory and Office from the 
relations Laboratory and Office respectively with the additional new class 
OfficeLaboratory. The new recovered class OfficeLaboratory has a multiple 
inheritance relationship between it and the classes Laboratory and Office. 
Figure 7(b) summarises all the recovered objects and relationships from the 
relations Laboratory and Office. 

(c-3) Disjunction dimension 

In this dimension, we have the following property: the external key is com- 
prised entirely of a simple key, the data stores of the dependent relation are 
not contained in the base class, and the data stores of the base class are not 
contained in the dependent relation. This property simulates a multiple simple 
inheritance relationship between classes. For the two relations, we recover the 
following object-oriented constructs: (i) a class that simulates the dependent 
relation, that is the relation which contains the external key, (ii) a class which 
simulates the base relation, (iii) a new abstract class containing all the data 
stores that are common to the dependent relation and base class, and finally 
(iv) two inheritance relationships to the new abstract class. 

If we consider the relations Office and Laboratory in the Medical schema 

Off ice(0FFICE_ID, location) 

Laboratory (LABORATORY.ID , laboratory_name) 

with the following properties 



Laboratory[LABORATORYJD] H OfficefOFFICEJD] = 0 (pi) 
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then examining these relations with the disjunction dimension holding, we 
can see that the instances of Office are not included in Laboratory (pi) and 
the instances of Laboratory are not included in Office (pi). Semantically, 
we can see that there is a third unknown class from which both the classes 
Room and Laboratory inherit. To map this relationship, we create the classes 
Laboratory and Office from the relations Laboratory and Office respectively. A 
new superclass say Room is also recovered with inheritance relationships with 
the both classes Laboratory and Office. Figure 7(c) summarises the different 
concepts recovered during the mapping of the two relations Laboratory and 
Office. 

(d) part External Key - part Composite Key (pEK-pSK) 

The last case illustrates a situation where a primary key and an external key 
share some attributes, but not all the attributes as in the previous case. The 
recovered relationships in this type of situations are stronger than association 
relationships because some of the attributes that identify the tuples of a rela- 
tion (first relation) are in the external key (which is a primary key of another 
relation, say the second relation). Formally, if we denote by R\ and i ?2 the 
first relation respectively second relation, and furthermore Kr^ is a external 
key of iZi, then we have: Kr^ fl Kr^ / 0 and Kr^ - Kr^ ^ 0. Only a part 
of the external key Kr^ is used in the identification of the data stores of the 
relation R \ . 

(d-1) Equality dimension 

In order to map the first and second relations, we split the first relation to 
express the fact that just a part of this relation is identified by the external 
key of the second relation. If we assume that a and b are the two partitions of 
the first relation, then the following object-oriented constructs are recovered: 
(i) three classes related to the second relation, and the partitions a and b 
respectively, (ii) an aggregation relationship between the class related to the 
second relation and the class a, and finally (iii) an inheritance relationship 
between a and b. 

Figure 8(a) shows a general mapping based on the relations i?i and i ?2 
introduced in the previous paragraph. In this instance, to map this relation 
correctly, we must first simplify the relation by decomposing the external keys 
of the relations into their constituent elements. 

(d-2) Overlap dimension 

In this situation a primary key and a external key only share some attributes 
and some tuples in the dependent or composite relations (first relation) are 
not in the base relation (second relation). This type of property simulates 
a simple and multiple inheritance relationship between classes by means of 
tuple inclusion. In this instance, we extract the following object-oriented con- 
structs: (i) a class, say a, is recovered from the second relation, (ii) a class, say 
b, is recovered from the first relation, however it contains only the attributes 
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which are not shared with the second relation, (hi) a common class, say ab, is 
recovered due to the overlap dimension, and (iv) a multiple inheritance rela- 
tionship from the class ab to the classes a and b is also recovered. Because of 
the inclusion dependencies between the two relations, we can assume by this 
semantic that there should be another new class created with a simple inher- 
itance relationship between it and classes a and b respectively. Figure 8(b) 
shows the different concepts recovered during this mapping. 




Figure 8 Object Recovery in the Case “pEK-pCK” 

Let us consider an extension of the Medical schema 

Phone (ROOM.ID, PHONE.NO) 

HomePhone(PHONE_ID, EMPLOYEE.NO) 

with the following properties 

HomePhonepHONE JD] f 1 Phone[PHONE JD] 7 ^ 0 
HomePhone[PHONE JD] - PhonefPHONEJD] ^ 0 
Phone[PHONE JD] - HomePhone[PHONEJD] ^ 0 



then examining these relations with the overlap dimension holding, we can 
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see that some instances of HomePhone are not included in Phone and some 
instances of Phone are not included in HomePhone. To map this relationship, 
we recover; (i) a class HomePhone that will contain the data stores not in the 
relation Phone, (ii) a class Phone with the tuples not in the relation Home- 
Phone, and (hi) a new class Phone-HomePhone containing the tuples in both 
the relations Phone and HomePhone. A multiple inheritance relationship is 
created between this new class Phone-HomePhone and the classes Phone and 
HomePhone. A new class C^ew is created with a simple inheritance relation- 
ship between it and the classes Phone and HomePhone. 

It is obvious that names of the some of classes, will need to be modified by 
the database administrator. For example, the class Phone should be renamed 
BusinessPhone, and the class C^ew should be renamed Phone. This will enable 
the schema to more obviously reflect the intended semantics. 



(d-3) Disjunction dimension 

In this situation, where a primary key and a external key only share some 
attributes but the tuples in the dependent or composite relations are not in 
the base class simulates a simple inheritance relationship between classes. In 
this instance, we recover the following object-oriented constructs: (i) a class 
representing the second relation (i.e. the relation which contains the foreign 
key as primary key), (ii) a class representing the first relation, and (iii) because 
of the inclusion dependencies between the two relations, we can assume by 
this semantic that ‘there should be another new class created with a simple 
inheritance relationship between it and the classes generated in (i) and (ii). 
Figure 8(c) shows all the recovered concepts from the two relations. 



6 CONCLUSION 

In this paper we presented different possible scenarios to map a relational 
schema into an object-oriented model. The proposed reengineering approach 
is based on the analysis of relational databases both at the schema level and 
the data level. At the schema level, the correlation between relation’s keys is 
analysed, and at the data level, the correlation between data stores is checked. 
Combining these two levels of correlation, we can recover hidden semantics 
from relational databases. 

The different cases described in the proposed approach can be implemented 
as operations to support the understanding of the underlying semantics of 
local databases of federated databases. Within the DOK environment, the 
wrapper translation procedures provide an initial object-oriented schema that 
requires, by successive refinements, some improvements in terms of the quality 
of the generated information . This ability is needed to refine or enhance a 
generated schema based on an extensive use of object-oriented concepts, such 
as polymorphism. Currently we are providing an approach for refining an 
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object-oriented schema for example by (i) creating polymorphic classes and 
(ii) deleting extraneous classes. 

Our future work is concerned with the design of specific mapping proce- 
dures that translate federated queries into local databases. These queries are 
expressed using OQL (Object Query Language) (Cattel et al 1994), translated 
into the OVAL algebra (Savnik et al 1998), and finally translated into local 
sub-queries that are executed using the query specific service. Our approach 
will be using query graphs to perform query translation from OVAL algebra 
to local database query languages, similar to the one proposed in (Meng et 
al 1995). 
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Abstract 

The ODMG proposal has helped to focus the work on object-oriented data- 
bases (OODBs) onto a common object model and query language. Neverthe- 
less there are several shortcomings of the current proposal stemming from the 
adaption of concepts of object-oriented programming and a lack of formaliza- 
tion. In this paper we present a formalization of the ODMG model and the 
OQL query language that is used in the CROQUE project as a basis for query 
optimization. An essential part is a complete, formally sound type system that 
allows us to reason about the types of intermediate query results and gives 
rise to fully orthogonal queries, including useful extensions of projections and 
set operations. 



1 INTRODUCTION 

For a long time, the evolution of OODB seemed to disperse in quite dif- 
ferent directions: there were rather distinct object-oriented database models 
(OODMs) either based on (nested) relational formalisms or OOPL-like no- 
tions, and hardly any consensus about the structure and formalization of 
queries. Nowadays, researchers and commercial products try to find a com- 
mon language, usually using the notations of the ODMG (Cattell 1996, Cattell 
1997) in order to introduce their specific concepts. It seems that many ideas 
which appeared rather different and contradictory - like OOPL-style versus 
SQL-like programming or value-based databases versus object notions - can 
be put together in order to get a full-fledged OODBS in the future. Neverthe- 
less, several problems remain, some of which are attacked in this paper: 
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• 00 data models and query languages: Up to now there is quite a big gap 
between the advanced OODMs of several research projects and the rather 
simple OODMs used in commercial OODBS. While the latter ones are 
mostly restricted to the underlying OOPL, more advanced models offer nice 
features such as orthogonal subtyping, the explicit distinction of class hier- 
archies and type hierarchies, support for objects and values (without object 
identity), and an orthogonal query language, based on formal (e.g. logic) 
definitions. 

The ODMG model is somewhere in between. While there are some “good- 
ies” such as arbitrarily complex types, and the distinction of values (called 
immutable objects or literals) and objects (also named mutable objects), 
several aspects are not present or clarified until now: 



- Objects are handled like in OOPLs. That is, there is only a type hi- 
erarchy, while reasoning about object collections (“subclassing”) is not 
possible, because an object formally only belongs to one specific class 
(where it was created). Several research projects, however, have shown 
that, in a database context, the ability to arrange object collections in 
an inclusion hierarchy provides a powerful basis for concepts such as 
integrity constraints, views, and derivation rules (Scholl, Laasch, Rich, 
Schek and Tresch 1993, Kuno and Rundensteiner 1996, Ceri and Man- 
they 1993). 

- While the given set of query operations of ODMG-OQL-1.2 seems rather 
complete for practical purposes, there is a lack of a formalization of such 
queries, which is essential for query optimization. In its current form the 
ODMG standard is more like a skeleton for data and query models rather 
than a sound formal model itself. 

- The ODMG standard does not provide any declarative object manip- 
ulation operations, except for object creation. As a consequence, up- 
date transactions completely rely on the OOPL the OODB is bound to 
(e.g. C-f-+). Obviously, the OODBMS can hardly be expected to pro- 
vide any help in analyzing transactions, for example to identify conflicts 
for semantic concurrency control, to check preservation of integrity con- 
straints, or to derive update propagation rules, if they are coded in an 
imperative language. The standard completely lacks an object manip- 
ulation language (OML), e.g. in the style of update, delete, and insert 
statements of SQL (see also (Laasch and Scholl 1992)). 



• 00 optimization and query processing: While relational query processing 
has been investigated thoroughly, there are only initial frameworks for 00 
query processing and optimization. Up to now, it is not clear how to inte- 
grate these first ideas in order to get efficiency for all ODMG queries. Many 
issues are open, especially the interaction of query optimization mecha- 
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nisms with useful physical database design choices that need to be offered 
for the storage of object databases. 

In the CROQUE project*, we are designing and implementing an object 
database system. We use the syntax of the ODMG proposal and added the 
missing formalization. In order to get a clean formal approach, we changed mi- 
nor parts of the original ODMG proposal (Cattell 1996), which are explained 
in detail in the next section. More specifically, we extended our preliminary 
work on the OODB models COCOON (Scholl et al. 1993) and EXTREM 
(Horner and Heuer 1991) in order to cover the concepts of (Cattell 1996). We 
added a general typing theory for mutable and immutable objects which is 
used as a basis of the formalization of the query language. Our rigid type sys- 
tem allows only well-typed ODMG queries and we could extend the applicabil- 
ity of set operations for collections of both mutable and immutable objects. 
Also there is a cast operation (to supertypes) based on the type structure 
which allows a flexible kind of projection of complex structured values. In the 
case of mutable objects, this is similar to “object-preserving” projections of 
other OOQLs. Such a concept is not present in the current ODMG proposal. 

Very recently, the ODMG standard 2.0 (Cattell 1997) was released. We plan 
to integrate the concepts of the new proposal into the CROQUE model. In 
this paper, all references to the ODMG proposal refer to the ODMG standard 
1.2, but we give an overview of how the ODMG 2.0 standard may be combined 
with our proposal in the conclusion. Here we only state that the changes of 
ODMG 2.0 can be integrated into our approach in a somewhat straightforward 
manner. 

The rest of this paper is organized as follows: In the next section we explain 
the differences between our approach and the ODMG proposal. Additionally, 
we give an overview of formalization efforts in the field of object-oriented 
database languages related to the concepts of the ODMG approach. Sec- 
tion 3 presents the formal model (ODL), an analysis of the formalization 
of CROQUE-OQL is given in Section 4, while an excerpt of the formalization 
of the query language is given in the Appendix. A complete formalization can 
be found in (Riedel and Scholl 1996). 



2 COMPARISON AND RELATED WORK 



The ODMG query language is rather rich of concepts, so that the claim of 
its inventors that it “can easily be formalized” (Cattell 1996) must be seen 
rather critical. We made the following observations: 



* CROQUE (a shorthand for Cost and Rule-based Optimization of object-database 
QUEries) is a joint project of U Konstanz with U Rostock, partially funded by the German 
National Research F\ind (DFG). 
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• Because the concepts of the query language are only introduced by (rather 
simple) examples, it is not always clear what the exact intended semantics 
would be for more complex queries. 

• The use of run-time exceptions in the query language seems to be a bad 
design choice for a query language, because further control strategies are 
necessary, when OQL is used within application programs. It seems rather 
odd to allow queries, which behave harmful in certain circumstances. Ad- 
ditionally, the analysis of the potential for query optimization is rather 
difficult, if exceptional cases have to be considered all over the place. 

In order to get a clean formalization, we took a special view the ODMG 
model and query language guided by commonly accepted design principles for 
OOQLs (Atkinson, Bancilhon, DeWitt, Dittrich, Maier and Zdonik 1989, Yu 
and Osborn 1991, Heuer and Scholl 1991): 

• The CROQUE data model is based on the ideas of Beeri’s OODB model 
(Beeri 1990) and more directly on COCOON and EXTREM. It can handle 
values (immutable objects) as well as objects, and orthogonal type con- 
structors (tuples, sets, bags, lists, and arrays) are supported. More specifi- 
cally, a rigorous, strict type system for mutable and immutable objects is 
the basis for a closed query language where each query result is a part of 
the data model. 

• The query language is statically typed. Thus, queries can be type-checked 
at compile-time. Only well-typed queries are allowed, so that some ODMG 
queries are excluded in our approach. But the intention of those queries can 
easily be achieved using a different syntactical form. Furthermore we had 
to introduce null values in query results to capture the element operator 
applied to non-singletons. 

• In CROQUE-OQL, as far as presented in this paper, the treatment of 
methods in queries is incomplete. A method call in a query is currently 
treated like an access to an attribute, thus we only check whether it is 
well-typed using the presented type system, and thus make sure that there 
are no runtime exceptions. This is enough for our initial purpose, namely 
building a query optimizer for declarative OQL queries. A more complete 
treatment of methods is part of our future work. 

• The ODMG proposal is rather informal for set operations on complex val- 
ues (structured immutable objects) and object types. Here, CROQUE pro- 
vides orthogonal use of union, intersection, and set difference for arbitrary 
collections of objects and for all literals where the types have a common 
supertype (union and except) or a common subtype (intersect). 

While the work presented in this paper covers the OQL language description 
as given in (Cattell 1996), the following open issues have to be worked out in 
more detail in the future: 
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• Although a nil value is mentioned in (Cattell 1996), it is open how nulls 
should be supported in OQL. Due to a lot of detail problems, we just took 
a simple approach by using a null just as another value of the specific 
domain. In our approach, selections only result in objects or values where 
the conditions evaluate to true. Thus, nulls are treated similar to false in 
this case. In other words, we return the t rue-result only, not the maybe- 
result (Biskup 1983). We mention that nulls are not newly introduced by 
queries, iff the database is null-free and null constants and the element 
operator are not present in the query. 

The main problem of a full-fledged treatment of nulls in OOQLs (and OQL 
in particular) is the interpretation of the database and the query results. It 
has to be clarified how null values can be used in the database and how the 
detail problems for the OQL clauses should be treated, especially when, by 
the use of query operations, null values migrate to object-valued attributes 
or represent the object identity. 

• A major effort for the future will be the integration of ODMG-OQL and 
the upcoming SQL-3 standard. In (Cattell 1996) a rather simplistic ap- 
proach is described, where SQL clauses can be realized as macros within 
the ODMG-OQL framework. It is not clear, though, how this may work 
together (efficiently) with more enhanced SQL statements, such as outer 
joins or the multiple use of aggregations in the select clause of select- 
from- where blocks. Also a lot of detail problems have to be solved when 
the ODMG proposal has to be integrated into the upcoming SQL/Object 
specification (Melton 1996). 



Over the last years, quite a few attempts have been published that aim 
at providing clean formalizations for OQL, for instance (Bancilhon and Kho- 
shafian 1989, Abiteboul and Kanellakis 1989, Straube 1991, Bertino, Negri, 
Pelagatti and Sbattella 1992, Kifer, Kim and Sagiv 1992, Kifer, Lausen and 
Wu 1993, Hohenstein and Engels 1992, Herzig and Gogolla 1994, Kamel, Wu 
and Su 1994, Fegaras and Maier 1995). However, only few of these integrate 
all concepts of the ODMG proposal into a single formal model. Usually they 
use a calculus-based or algebra-based paradigm as the basis of formalization, 
which results in covering only a subset of the OQL concepts. It is not clear 
whether they can be extended easily to incorporate the full spectrum. Here 
we comment on some of the research work in more detail. 

• Early approaches like Straube’s calculus and algebra (Straube and Ozsu 
1990, Straube 1991) are quite simple extensions of the relational approach, 
where the complex structures of the ODMG proposal can neither be built 
within the data model nor in the query language. On the other side, it 
was possible to derive results for the equivalence of calculus and algebra, 
simplification rules, and optimized evaluation plans. Nevertheless, due to 
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its simplicity, it seems that this approach is not appropiate for the ODMG 
model. 

• An example of a rather complex object algebra is the AQUA algebra 
(Leung, Mitchell, Subramanian, Zdonik and other 1993) proposed for the 
EREQ project. Here several operators are defined in a functional way to 
capture different meanings of set operations. Because it explicitly handles 
different kinds of collections by different algebra operators, it is more suit- 
able as an “internal” algebra for query processing, while OQL is more 
likely an interface language, where the evaluation problems are described 
on a different level. In fact, the EREQ project pursues quite similar goals 
as CROQUE. We, however, try to avoid the “explosion” of algebraic op- 
erators due to different collection types by using monoid comprehensions 
(Grust and Scholl 1996). 

• Among different SQL-based extensions, XSQL (Kifer et al. 1992) uses F- 
logic (Kifer et al. 1993) to get a formal semantics. Therefore, the data 
model is more restricted than the ODMG approach, because the underlying 
data model only supports one-level set values. On the other side, queries 
can be defined as subclasses of other classes and path expressions are more 
flexible than in the ODMG-OQL approach. Also the well-typed application 
of method calls and the access to the meta-level are supported. 

• Some formal approaches (like (Abiteboul and Kanellakis 1989, Kifer et al. 
1993, Fegaras and Maier 1995)) can be seen as extensions of nested rela- 
tional calculus. Usually, the clean treatment of lists, bags, and arrays within 
such languages is rather difficult, because the access and construction of 
such values is not as declarative as for sets. Also query languages for struc- 
tured types face the problem that they have to be restricted syntactically 
in order to get a first order query language (Beeri 1989). 

• A calculus-based formalization of SQL-like queries is given in (Hohenstein 
and Engels 1992) in the context of an extended ER model. The work is 
useful within the ODMG context for the calculus-based parts of OQL, like 
the select-from-where block. A similar approach (Herzig and Gogolla 
1994) also treats arbitrarily structured objects, but both approaches do 
not explicitly support the general type system. 

• A recent approach to integrate complex types into set expressions is done 
by Fegaras and Maier (Fegaras and Maier 1995) using so-called monoid 
comprehensions. This allows a generic argumentation on operations of dif- 
ferent types. In contrast to the approach of this paper, the typing of the 
queries has to be done explicitly in the query expressions. Nevertheless, 
monoid comprehensions are rather useful to get a basis for 00 query opti- 
mization and are also exploited in the CROQUE project (Grust and Scholl 
1996, Grust, Kroger, Gluche, Heuer and Scholl 1997). 
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3 THE DATA MODEL AND ODL 

This section presents the CROQUE approach to define a formal semantics of 
an OQL. We show the flavor of CROQUE and the differences to ODMG by 
the running example of the paper. The example database stores information 
about restaurants, their menus and employees. A graphical notation of its type 
structure is given in Figure 1, where we extended the graphical notations of 
ODMG slightly in order to capture the complex types directly. 

The ODMG model is a type~based object model, where information about 
object collections need not be present in the schema. In the figure above, 
each box represents an object type with its attributes. Simple arrows point 
to the type of component objects, while the thick arrow denotes the subtype 
relationship. Possible collection types are sets, bags, lists, and arrays. 

Our formalization builds upon the BCOOL model presented in (Laasch 
and Scholl 1993a). While our primary goal is to formalize the ODMG object 
model, we took the freedom to modify the model in the following two major 
respects: 

1. In CROQUE, mutable objects are atomic. 

The replication of large parts of the ODMG (meta-) type system accord- 
ing to the distinction of mutable and immutable structured objects seems 
unnecessary to us. We rather adopt the common understanding that all 
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structured “objects” are literals (i,e., structured values). ODMG’s struc- 
tured mutable objects would be represented as an atomic (mutable) object 
with one property that in turn contains the (immutable) structure in the 
CROQUE model. While this simplification might be debatable from an 
OOPL point of view, in our context of a semantical object model it is only 
a minor issue. 

2. In CROQUE, objects can be an instance of multiple types (at the same 
time, and — by means of gain/lose operations — throughout their lifetime). 
Both features are not included in the ODMG proposal, but they are men- 
tioned as planned for the final release. In order to provide more flexible 
(object-preserving) query functionality, we added this from the beginning 
(see also (Laasch and Scholl 1993a, Schek and Scholl 1990)). 



Furthermore, we adopt the BCOOL approach to arrange object types into 
a lattice (as opposed to just an arbitrary (multiple) hierarchy), such that 
(object-preserving) projections (“casts” in the OQL terminology) need not be 
restricted to named supertypes present in the ODB schema. ODL types are 
formalized as follows: there is one basic sort for mutable objects (the domain of 
object identifiers). Our type system builds a lattice of (named and unnamed) 
object types below this basic sort. Each ODL-object type defines a named 
type by the set of “characteristics” (attributes and relationships, operations). 
ODL characteristics are formalized as functions (attributes are literal-valued 
functions, relationships are object- valued - possibly multi-valued) functions. 
ODL operations represent methods, that is, computed properties (no side- 
effects) and update operations (with side-effects). 

In the sequel, we define a formal type system for the CROQUE-version 
of ODL. Basic types describe (pairwise) disjoint sets of instances. There are 
several basic types for the atomic literals {Sjnt, plus one basic 

type for mutable objects {6 object), on which object types can be defined by 
subtyping (see below). Structured types can be specified using the built-in type 
constructors set ({ }), bag ({ J), list (()), array ([]), struct (()), and function 
(->). Types serve several purposes: (i) they represent a “repository” of possible 
values (this will be called the domain of the type below, an intensional notion 
of type); (ii) they are used by the compiler for type checking (i.e., assuring that 
only “(type) valid” expressions are ever executed). For example, we would not 
allow to compute the square root of a string. Finally, (iii) types can be used 
as containers (collections) for those values of that type, which are currently 
“in use” in the database (in ODL: “with extension”). The latter use of types 
(extensions) is typically only common with mutable types (where we will talk 
about the “active domain” of the type). Notice that, on the formal level, we 
will use active domains regardless of whether the ODB schema contains the 
“with extension” clause or not. 

We use a denotational approach to specify the formal semantics. A formal 
language is defined syntactically, typing rules and semantics are then defined 
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for that language. As usual in the denotional approach, semantics are defined 
by giving denotation functions . .f’ that map syntactic constructs of the 
defined language to (operations on) the so-called “semantic domains”. OQL’s 
mapping to the formal language is straightforward and illustrated in examples 
throughout this paper. 

Syntax. The syntax of a formal CROQUE type expression r is given by 
the following list: 

r = I 6int / ♦ INTEGER ♦ / 

I Sbooi / ♦ BOOLEAN ♦ / 



I Sobject / * 

|[/l,...,/n] /* 

l{r} /♦ 

\ir} /* 

\{r) /* 

|r[] /♦ 

I ( ri,...,Tn ) /♦ 
I ^Object T I * 



mutable object sort 
object types ♦ / 
set types * / 
bag types ♦ / 
list types * / 
array types * / 
struct types * / 
function types ♦ / 



*/ 



Semantics. A type may be referenced by a label (e.g. [name,agey children] 
by person). Labelling is mandatory for structs but optional for other struc- 
tured types. Therefore, we decide to define the semantics of a struct based on 
the given label (name equivalence), while other types are defined by their in- 
ternal structure (structural equivalence). We use a set CABSCS as the name 
space for such labels. 

The semantic domain of values is defined by the following recursive domain 
equations: 

V = Vint u VbooI U V string U Vobject U.FU5UBU£U^UT 
VbooI = {±Booi, true, false}, 

Hint — {-L/nt, 0, 1, 1,2...}, 

Vstring = {Utringf'a\^^A\...}, 

H Object contains countably infinite objects, 

= Hobject fin V, 

^ ~ P/in(V), 

B = V-^fin { 0 , 1 , 2 ,...}, 

C = {0,1,2,...}-^ fin V, 

^={0,1,2,...} -^fin V, 

T = CABSCS -^fin V. 

Vi are domains of basic values (e.g., boolean and integer). T denotes the do- 
main of finite mappings from Vobject to V, and S all finite powersets over 
V. The domains for the other constructed types (T for structs, C for lists, B 
for bags, and A for arrays) are modelled as (finite) function domains map- 
ping index values to elements (structs, arrays, and lists) and elements to 
positive integers (bags), respectively. Defining arrays like lists is sufficient for 
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our purposes and avoids some well-known problems with the formalization of 
functional arrays (Libkin, Machlin and Wong 1996). The type-specific bot- 
tom elements (±i) denote undefined values. In order to improve readability 
we omit the type information and use ± instead in the sequel. 

In general, equality must be defined for types on which sets are constructed 
(e.g., for testing set-membership). Because the equality for function types is 
undecidable in general, the domains of function types are restricted to objects. 
The equality on these restricted functions would be still undecidable, because 
the domain Vohject is infinite. However, since all instances of functions that 
can ever occur in any database state are restricted to the active domains of 
the corresponding object types (which are finite sets), all functions can be 
regarded as finite sets of pairs, such that equality is decidable. Hence, we do 
not need to separate types with equality from those without equality, which 
would be necessary otherwise. 

Basic and Constructed Types, Except for object types, our semantics 
of types and subtyping is quite usual and follows (Balsters and de Vreeze 
1991, Balsters and Fokkinga 1991, Mannino, Choi and Batory 1990): i.e., the 
denotations ([. . .J) of basic types are given by the following equations: 

Definition Semantic Domain: 

[(Ji ] = in case that is a summand of V 
[n t 2 ] = {f e \ X £[ti] f{x) £[t2]} 

[{r}] ={xeS\xC [r]} 

= {/I/ • ['^l "^/in {0,1,2, ...}} 

[(r)] = {/I/ : {0,1,2,...} [r]} 

[rlU ={/|/:{0,l,2,...}->/,.[rl} 

l(Li :ri,,..,Ln :rn)J = {f : {Lu L2, . , . , Ln} \Jii^i]\f{Lj) e [tj] {j = 
l,...,n)} 

Subtyping is used to describe (sub)-sets of objects with common interfaces, 
such that type-checking becomes more meaningful. The CROQUE definition 
of a subtype consists of three parts: a set of supertypes, a set of local charac- 
teristics, and (possibly) a type name. Any instance of the subtype is also an 
instance of its supertypes {substitutability), and all characteristics defined on 
the supertypes are applicable to the instances of the subtype {inheritance of 
the interface), in addition to the locally defined ones. Formally, object types 
need not be named, they are given by listing the set of characteristics.’^ For 
example, if person is an object type with attributes name, age, and a (set- 
valued) relationship children, and cook a subtype of person, with the additional 
attributes title and specialities, the two object types will be referred to as 



* Notice that the ODMG policy of solving naming conflicts due to multiple inheritance by 
renaming leads to unique characteristics’ names, and hence types are uniquely identified by 
the set of characteristics. 





